Latent variable models assume that observed data is generated from hidden (unobserved) variables, and learning means inferring those hidden variables and their relationship to the data.

P_{θ} (x) = z \sum P_{θ} (x, z) or P_{θ} (x) = \int_{z} P_{θ} (x, z) d z

Here $z$ is an latent/unobserved/hidden random variable. Typically $z$ is jointly estimated along with the model parameters $θ$ . For each $x_{i} \in D$ we assume that there exists a corresponding $z_{i}$ .

If $z$ is discrete, $z \in {z_{1}, z_{2}, \dots, z_{m}}$ . Because for every $x_{i} \in D$ there would be a corresponding $z_{i}$ , we can create $m$ such buckets for each $x_{i}$ . This is basically what K-Means Clustering and Gaussian Mixture Models aim to do.
If $z$ is continuous, $x \in R^{d}, z \in R^{k}$ . Typically $k << d$ . Here $z_{i} ∣ x_{i}$ represents a feature vector corresponding to the given $x_{i}$ .

Principle for Learning LVMs

Suppose we have a dataset $D = {x_{i}}_{i = 1}^{n}$ and a Latent Variable Model $P_{θ} (x) = \int_{Z} P (x, z) d z$ . Our goal here would be to estimate the model parameter $θ$ given $D$ . This can be done by minimizing the KL Divergence.

θ^{*} = ar g θ min D_{K L} (P_{x} ∣∣ P_{θ}) = ar g θ min [\int_{X} P_{X} (x) l o g (\frac{P _{X} ( x )}{P _{θ} ( x )}) d x] = ar g θ min [\int_{X} P_{X} (x) l o g (P_{X} (x)) d x - \int_{X} P_{X} (x) l o g (P_{θ} (x)) d x] = ar g θ min [\int_{X} P_{X} (x) l o g (P_{X} (x)) d x^{Independent of θ} - \int_{X} P_{X} (x) l o g (P_{θ} (x)) d x] = ar g θ min [- \int_{X} P_{X} (x) l o g (P_{θ} (x)) d x] = ar g θ max [\int_{X} P_{X} (x) l o g (P_{θ} (x)) d x] = ar g θ max P_{X} E [l o g (P_{θ} (x))]

Here $l o g (P_{θ} (x))$ is the log-likelihood function of $P_{θ} (x)$ . Thus this optimization problem is called Maximum Likelihood Estimation. Instead of solving the entire optimization problem at once, let’s just consider the following,

l (θ) = l o g P_{θ} (x) = l o g \int_{Z} P_{θ} (x, z) d z = l o g \int_{Z} P_{θ} (x, z) \frac{q ( z ∣ x )}{q ( z ∣ x )} d z = l o g \int_{Z} q (z ∣ x) \frac{P _{θ} ( x , z )}{q ( z ∣ x )} d z = l o g q (z ∣ x) E [\frac{P _{θ} ( x , z )}{q ( z ∣ x )}]

By Jensen’s Inequality we know that $l o g E [f (x)] \geq E l o g [f (x)]$ . So applying this on the above equation for $l (θ)$ we get,

l (θ) = l o g q (z ∣ x) E [\frac{P _{θ} ( x , z )}{q ( z ∣ x )}] \geq q (z ∣ x) E l o g [\frac{P _{θ} ( x , z )}{q ( z ∣ x )}] (denoted as J_{θ} (q))

Here $J_{θ} (q)$ is called the Evidence Lower Bound (ELBO). $J_{θ} (q)$ is function of both the model parameters $θ$ and the density on $z$ , $q (z ∣ x)$ . $q (z ∣ x)$ is called the Variational Latent Posterior. Similar to VDM, here too we maximize a lower-bound of a value in order to optimize it.

θ^{*}, q^{*} where, J_{θ} (q) = ar g θ, q max J_{θ} (q) = q (z ∣ x) E [l o g [\frac{P _{θ} ( x , z )}{q ( z ∣ x )}]]

Gaussian Mixture Models (GMM)

In GMMs $z$ is discrete, $z \in {1, 2, \dots, M}$ .

P_{θ} (x) = Z \sum P_{θ} (x, z) = j = 1 \sum M P_{θ} (z = j) P_{θ} (x ∣ z = j)

In a GMM, $P_{θ} (z = j) = α_{j}$ , $P_{θ} (x ∣ z = j) = N (x; μ_{j}, Σ_{j})$ .

P_{θ} (x) = j = 1 \sum M P_{θ} (z = j) P_{θ} (x ∣ z = j) = j = 1 \sum M α_{j} \cdot N (x; μ_{j}, Σ_{j})

Parameters of a GMM are,

θ = {α_{1}, α_{2}, \dots, μ_{1}, μ_{2}, \dots, Σ_{1}, Σ_{2}, \dots}

Here $x \in R^{d}$ , $μ_{j} \in R^{d}$ , and $Σ_{j} \in R^{d \times d}$ and $0 \leq α_{j} \leq 1, \sum_{i = 1}^{m} α_{j} = 1$ . Since our goal is to estimate $θ, q$ via ELBO optimization, we can use the Expectation Maximization Algorithm (EM Algorithm) which updates both $θ, q$ alternatively.

For t = 1 to T : q_{t + 1}^{*} θ_{t + 1}^{*} = ar g q max J_{θ^{t}} (q) = ar g θ max J_{θ} (q_{t + 1}) (with θ_{t} as constant) (with q_{t + 1} as constant)

It can be shown that EM ensures that $l (θ_{t + 1}) \geq l (θ_{t})$ . This doesn’t ensure that the likelihood function will keep on increasing, but it ensures that it won’t decrease as the parameters get updated.

Applying EM algorithm for the GMM, it can be shown analytically (try once) that,

q_{t + 1}^{*} θ_{t + 1}^{*} = ar g q max J_{θ} (q) = P_{θ} (z ∣ x) = \frac{P _{θ} ( x ∣ z ) \cdot P _{θ} ( z )}{P _{θ} ( x )} = \frac{N ( x ; μ _{j} , Σ _{j} ) \cdot α _{j}}{\sum _{j = 1}^{μ} N ( x ; μ _{j} , Σ _{j} ) \cdot α _{j}} = ar g θ max J_{θ} (q) (Excercise to show) (P_{θ} (x) = j = 1 \sum M P_{θ} (z_{j}) P_{θ} (x ∣ z_{j}))

ELBO can be optimized for an LVM using the EM algorithm, if $P_{θ} (z ∣ x)$ can be computed.

Variational Auto-Encoders (VAE)

GATE & BS Notes

Explorer

Latent Variable Models

Principle for Learning LVMs

Gaussian Mixture Models (GMM)

Variational Auto-Encoders (VAE)

Graph View

Table of Contents

Backlinks