dpcluster Package¶

`algorithms` Module¶

class dpcluster.algorithms.OnlineVDP(distr, w=0.1, k=25, tol=0.001, max_items=100)[source]¶

Experimental online clustering algorithm.

Parameters:	distr – likelihood-prior distribution pair governing clusters. For now the only option is using a instance of `dpcluster.distributions.GaussianNIW`. w – non-negative prior weight. The prior has as much influence as w data points. k – maximum number of clusters. tol – convergence tolerance. max_items – maximum queue length.

get_model()[source]¶

Get current model.

Returns:	instance of `dpcluster.algorithms.VDP`

put(r, s=0)[source]¶

Append data.

Parameters:	r – sufficient statistics of data to be appended.

Basic usage example:

>>> distr = GaussianNIW(data.shape[2])
>>> x = distr.sufficient_stats(data)
>>> vdp = OnlineVDP(distr)
>>> vdp.put(x)
>>> print vdp.get_model().cluster_parameters()

class dpcluster.algorithms.Predictor(model, ix, iy)[source]¶

distr_fit(*args)¶

precomp(*args)¶

predict(*args)¶

predict_old(z, lgh=(True, True, False), full_var=False)[source]¶

class dpcluster.algorithms.PredictorKL(model, ix, iy)[source]¶

predict(*args)¶

predict_old(z, lgh=(True, True, False), full_var=False)[source]¶

class dpcluster.algorithms.VDP(distr, w=0.1, k=50, tol=1e-05, max_iters=10000)[source]¶

Bases: object

Variational Dirichlet Process clustering algorithm following “Variational Inference for Dirichlet Process Mixtures” by Blei et al. (2006).

Parameters:	distr – likelihood-prior distribution pair governing clusters. For now the only option is using a instance of `dpcluster.distributions.GaussianNIW`. w – non-negative prior weight. The prior has as much influence as w data points. k – maximum number of clusters. tol – convergence tolerance.

batch_learn(x, verbose=False, sort=True)[source]¶

Learn cluster from data. This is a batch algorithm that required all data be loaded in memory.

Parameters:	x – sufficient statistics of the data to be clustered. Can be obtained from raw data by calling `dpcluster.distributions.ConjugatePair.sufficient_stats()` verbose – print progress report sort – algorithm optimization. Sort clusters at every step.

Basic usage example:

>>> distr = GaussianNIW(data.shape[2])
>>> x = distr.sufficient_stats(data)
>>> vdp = VDP(distr)
>>> vdp.batch_learn(x)
>>> print vdp.cluster_parameters()

cluster_parameters()[source]¶

Returns:	Cluster parameters.

cluster_sizes()[source]¶

Returns:	Data weight assigned to each cluster.

conditional_expectation(*args)¶

conditional_ll(x, cond)[source]¶

Conditional log likelihood.

Parameters:	x – sufficient statistics of data. cond – slice representing variables to condition on

conditional_variance(x, iy, ix, ret_ll_gr_hs=(True, False, False))[source]¶

ll(x, ret_ll_gr_hs=(True, False, False))[source]¶

Compute the log likelihoods (ll) of data with respect to the trained model.

Parameters:	x – sufficient statistics of the data. ret_ll_gr_hs – what to return: likelihood, gradient, hessian. Derivatives taken with respect to data, not sufficient statistics.

marginal(*args)¶

plot_clusters(**kwargs)[source]¶: Asks each cluster to plot itself. For Gaussian multidimensional clusters pass slc=np.array([i,j]) as an argument to project clusters on the plane defined by the i’th and j’th coordinate.

pseudo_resp(*args)¶

pseudo_resp_cache(*args)¶

resp(*args)¶

resp_cache(*args)¶

var_cond_exp(x, iy, ix, ret_ll_gr_hs=(True, False, False), full_var=False)[source]¶

`distributions` Module¶

class dpcluster.distributions.ConjugatePair(evidence_distr, prior_distr, prior_param)[source]¶

Conjugate prior-evidence pair of distributions in the exponential family. Conjugacy means that the posterior has the same for as the prior with updated parameters.

Parameters:	evidence_distr – Evidence distribution. Must be an instance of `ExponentialFamilyDistribution` prior_distr – Prior distribution. Must be an instance of `ExponentialFamilyDistribution` prior_param – Prior parameters.

posterior_ll(x, nu, ret_ll_gr_hs=(True, False, False), usual_x=False)[source]¶

Log likelihood (and derivatives) of data under posterior predictive distribution.

Parameters:	x – sufficient statistics of data nu – prior parameters

sufficient_stats(data)[source]¶

sufficient_stats_dim()[source]¶

class dpcluster.distributions.ExponentialFamilyDistribution[source]¶

Models a distribution in the exponential family of the form:

\(f(x | \nu) = h(x) \exp( \nu \cdot T(x) - A(\nu) )\)

Parameters to be defined in subclasses:

h is the base measure
nu (\(\nu\)) are the parameters
T(x) are the sufficient statistics of the data
A is the log partition function

ll(xs, nus, ret_ll_gr_hs=(True, False, False))[source]¶

Log likelihood (and derivatives, optionally) of data under distribution.

Parameters:	xs – sufficient statistics of data nus – parameters of distribution

log_base_measure(x, ret_ll_gr_hs=(True, False, False))[source]¶

Log of the base measure. To be implemented by subclasses.

Parameters:	x – sufficient statistics of the data.

log_partition(nu, ret_ll_gr_hs=(True, False, False))[source]¶

Log of the partition function and derivatives with respect to sufficient statistics. To be implemented by subclasses.

Parameters:	nu – parameters of the distribution ret_ll_gr_hs – what to return: log likelihood, gradient, hessian

class dpcluster.distributions.Gaussian(d)[source]¶

Bases: dpcluster.distributions.ExponentialFamilyDistribution

Multivariate Gaussian distribution with density:

\(f(x | \mu, \Sigma) = |2 \pi \Sigma|^{-1/2} \exp(-(x-\mu)^T \Sigma^{-1} (x - \mu)/2)\)

Natural parameters:

\(\nu = [\Sigma^{-1} \mu, -\Sigma^{-1}/2]\)

Sufficient statistics of data:

\(T(x) = [x, x \cdot x^T]\)

Parameters:	d – dimension.

log_base_measure(x, ret_ll_gr_hs=(True, True, True))[source]¶: Log base measure.

log_partition(nus)[source]¶

nat2usual(nus)[source]¶: Convert natural parameters to usual parameters

sufficient_stats(x)[source]¶: Sufficient statistics of data. :arg x: data

sufficient_stats_dim()[source]¶: Dimension of sufficient statistics.

usual2nat(mus, Sgs)[source]¶: Convert usual parameters to natural parameters.

class dpcluster.distributions.GaussianNIW(d)[source]¶

Bases: dpcluster.distributions.ConjugatePair

Gaussian, Normal-Inverse-Wishart conjugate pair.

The predictive posterior is a multivariate t-distribution.

Parameters:	d – dimension

conditional(*args)¶

conditional_expectation(*args)¶

conditional_variance(x, nu, iy, ix, ret_ll_gr_hs=(True, True, False), full_var=True)[source]¶

conditionals_cache(*args)¶

conditionals_cache_bare(*args)¶

marginal(nu, slc)[source]¶

plot(nu, szs, slc, n=100)[source]¶

posterior_ll(*args)¶

posterior_ll_cache(*args)¶

sufficient_stats(data)[source]¶

class dpcluster.distributions.NIW(d)[source]¶

Bases: dpcluster.distributions.ExponentialFamilyDistribution

Normal Inverse Wishart distribution defined by:

\(f(\mu,\Sigma|\mu_0,\Psi,k) = \text{Gaussian}(\mu|\mu_0,\Sigma/k) \cdot \text{Inverse-Wishart}(\Sigma|\Psi,\nu-d-2)\)

where \(\mu, \mu_0 \in R^d, \Sigma, \Psi \in R^{d \times d}, k \in R, \nu > 2d+1 \in R\)

This is an exponential family conjugate prior for the Gaussian.

Parameters:	d – dimension

log_base_measure(x, ret_ll_gr_hs=(True, True, True))[source]¶

log_partition(nu, ret_ll_gr_hs=(True, False, False), no_k_grad=False)[source]¶

multipsi(a, d)[source]¶

nat2usual(*args)¶

sufficient_stats(mus, Sgs)[source]¶

sufficient_stats_dim()[source]¶

usual2nat(mu0, Psi, k, nu)[source]¶

`test` Module¶

class dpcluster.test.Tests(methodName='runTest')[source]¶

Bases: unittest.case.TestCase

gen_data(A, mu, n=10)[source]¶

setUp()[source]¶

test_batch_vdp()[source]¶

test_gaussian()[source]¶

test_gniw()[source]¶

test_gniw_conditionals()[source]¶

test_ll()[source]¶

test_niw()[source]¶

test_online_vdp(*args, **kwargs)[source]¶

test_predictor(*args, **kwargs)[source]¶

test_presp(*args, **kwargs)[source]¶

test_resp()[source]¶

test_vdp_conditionals()[source]¶

dpcluster.test.grad_check(f, x, eps=0.0001)[source]¶

dpcluster Package¶

algorithms Module¶

distributions Module¶

test Module¶

`algorithms` Module¶

`distributions` Module¶

`test` Module¶