dpcluster Package

algorithms Module

class dpcluster.algorithms.OnlineVDP(distr, w=0.1, k=25, tol=0.001, max_items=100)[source]

Experimental online clustering algorithm.

Parameters:
  • distr – likelihood-prior distribution pair governing clusters. For now the only option is using a instance of dpcluster.distributions.GaussianNIW.
  • w – non-negative prior weight. The prior has as much influence as w data points.
  • k – maximum number of clusters.
  • tol – convergence tolerance.
  • max_items – maximum queue length.
get_model()[source]

Get current model.

Returns:instance of dpcluster.algorithms.VDP
put(r, s=0)[source]

Append data.

Parameters:r – sufficient statistics of data to be appended.

Basic usage example:

>>> distr = GaussianNIW(data.shape[2])
>>> x = distr.sufficient_stats(data)
>>> vdp = OnlineVDP(distr)
>>> vdp.put(x)
>>> print vdp.get_model().cluster_parameters()
class dpcluster.algorithms.Predictor(model, ix, iy)[source]
distr_fit(*args)
precomp(*args)
predict(*args)
predict_old(z, lgh=(True, True, False), full_var=False)[source]
class dpcluster.algorithms.PredictorKL(model, ix, iy)[source]
predict(*args)
predict_old(z, lgh=(True, True, False), full_var=False)[source]
class dpcluster.algorithms.VDP(distr, w=0.1, k=50, tol=1e-05, max_iters=10000)[source]

Bases: object

Variational Dirichlet Process clustering algorithm following “Variational Inference for Dirichlet Process Mixtures” by Blei et al. (2006).

Parameters:
  • distr – likelihood-prior distribution pair governing clusters. For now the only option is using a instance of dpcluster.distributions.GaussianNIW.
  • w – non-negative prior weight. The prior has as much influence as w data points.
  • k – maximum number of clusters.
  • tol – convergence tolerance.
batch_learn(x, verbose=False, sort=True)[source]

Learn cluster from data. This is a batch algorithm that required all data be loaded in memory.

Parameters:

Basic usage example:

>>> distr = GaussianNIW(data.shape[2])
>>> x = distr.sufficient_stats(data)
>>> vdp = VDP(distr)
>>> vdp.batch_learn(x)
>>> print vdp.cluster_parameters()
cluster_parameters()[source]
Returns:Cluster parameters.
cluster_sizes()[source]
Returns:Data weight assigned to each cluster.
conditional_expectation(*args)
conditional_ll(x, cond)[source]

Conditional log likelihood.

Parameters:
  • x – sufficient statistics of data.
  • cond – slice representing variables to condition on
conditional_variance(x, iy, ix, ret_ll_gr_hs=(True, False, False))[source]
ll(x, ret_ll_gr_hs=(True, False, False))[source]

Compute the log likelihoods (ll) of data with respect to the trained model.

Parameters:
  • x – sufficient statistics of the data.
  • ret_ll_gr_hs – what to return: likelihood, gradient, hessian. Derivatives taken with respect to data, not sufficient statistics.
marginal(*args)
plot_clusters(**kwargs)[source]

Asks each cluster to plot itself. For Gaussian multidimensional clusters pass slc=np.array([i,j]) as an argument to project clusters on the plane defined by the i’th and j’th coordinate.

pseudo_resp(*args)
pseudo_resp_cache(*args)
resp(*args)
resp_cache(*args)
var_cond_exp(x, iy, ix, ret_ll_gr_hs=(True, False, False), full_var=False)[source]

distributions Module

class dpcluster.distributions.ConjugatePair(evidence_distr, prior_distr, prior_param)[source]

Conjugate prior-evidence pair of distributions in the exponential family. Conjugacy means that the posterior has the same for as the prior with updated parameters.

Parameters:
posterior_ll(x, nu, ret_ll_gr_hs=(True, False, False), usual_x=False)[source]

Log likelihood (and derivatives) of data under posterior predictive distribution.

Parameters:
  • x – sufficient statistics of data
  • nu – prior parameters
sufficient_stats(data)[source]
sufficient_stats_dim()[source]
class dpcluster.distributions.ExponentialFamilyDistribution[source]

Models a distribution in the exponential family of the form:

\(f(x | \nu) = h(x) \exp( \nu \cdot T(x) - A(\nu) )\)

Parameters to be defined in subclasses:

  • h is the base measure
  • nu (\(\nu\)) are the parameters
  • T(x) are the sufficient statistics of the data
  • A is the log partition function
ll(xs, nus, ret_ll_gr_hs=(True, False, False))[source]

Log likelihood (and derivatives, optionally) of data under distribution.

Parameters:
  • xs – sufficient statistics of data
  • nus – parameters of distribution
log_base_measure(x, ret_ll_gr_hs=(True, False, False))[source]

Log of the base measure. To be implemented by subclasses.

Parameters:x – sufficient statistics of the data.
log_partition(nu, ret_ll_gr_hs=(True, False, False))[source]

Log of the partition function and derivatives with respect to sufficient statistics. To be implemented by subclasses.

Parameters:
  • nu – parameters of the distribution
  • ret_ll_gr_hs – what to return: log likelihood, gradient, hessian
class dpcluster.distributions.Gaussian(d)[source]

Bases: dpcluster.distributions.ExponentialFamilyDistribution

Multivariate Gaussian distribution with density:

\(f(x | \mu, \Sigma) = |2 \pi \Sigma|^{-1/2} \exp(-(x-\mu)^T \Sigma^{-1} (x - \mu)/2)\)

Natural parameters:

\(\nu = [\Sigma^{-1} \mu, -\Sigma^{-1}/2]\)

Sufficient statistics of data:

\(T(x) = [x, x \cdot x^T]\)

Parameters:d – dimension.
log_base_measure(x, ret_ll_gr_hs=(True, True, True))[source]

Log base measure.

log_partition(nus)[source]
nat2usual(nus)[source]

Convert natural parameters to usual parameters

sufficient_stats(x)[source]

Sufficient statistics of data. :arg x: data

sufficient_stats_dim()[source]

Dimension of sufficient statistics.

usual2nat(mus, Sgs)[source]

Convert usual parameters to natural parameters.

class dpcluster.distributions.GaussianNIW(d)[source]

Bases: dpcluster.distributions.ConjugatePair

Gaussian, Normal-Inverse-Wishart conjugate pair.

The predictive posterior is a multivariate t-distribution.

Parameters:d – dimension
conditional(*args)
conditional_expectation(*args)
conditional_variance(x, nu, iy, ix, ret_ll_gr_hs=(True, True, False), full_var=True)[source]
conditionals_cache(*args)
conditionals_cache_bare(*args)
marginal(nu, slc)[source]
plot(nu, szs, slc, n=100)[source]
posterior_ll(*args)
posterior_ll_cache(*args)
sufficient_stats(data)[source]
class dpcluster.distributions.NIW(d)[source]

Bases: dpcluster.distributions.ExponentialFamilyDistribution

Normal Inverse Wishart distribution defined by:

\(f(\mu,\Sigma|\mu_0,\Psi,k) = \text{Gaussian}(\mu|\mu_0,\Sigma/k) \cdot \text{Inverse-Wishart}(\Sigma|\Psi,\nu-d-2)\)

where \(\mu, \mu_0 \in R^d, \Sigma, \Psi \in R^{d \times d}, k \in R, \nu > 2d+1 \in R\)

This is an exponential family conjugate prior for the Gaussian.

Parameters:d – dimension
log_base_measure(x, ret_ll_gr_hs=(True, True, True))[source]
log_partition(nu, ret_ll_gr_hs=(True, False, False), no_k_grad=False)[source]
multipsi(a, d)[source]
nat2usual(*args)
sufficient_stats(mus, Sgs)[source]
sufficient_stats_dim()[source]
usual2nat(mu0, Psi, k, nu)[source]

test Module

class dpcluster.test.Tests(methodName='runTest')[source]

Bases: unittest.case.TestCase

gen_data(A, mu, n=10)[source]
setUp()[source]
test_batch_vdp()[source]
test_gaussian()[source]
test_gniw()[source]
test_gniw_conditionals()[source]
test_ll()[source]
test_niw()[source]
test_online_vdp(*args, **kwargs)[source]
test_predictor(*args, **kwargs)[source]
test_presp(*args, **kwargs)[source]
test_resp()[source]
test_vdp_conditionals()[source]
dpcluster.test.grad_check(f, x, eps=0.0001)[source]