Table of Contents

Deep Generative Models

Variational Divergence Minimization

f-divergence

Algorithm for f-divergence minimization

Conjugate of a convex function

Realization of VDM (Variational Divergence Minimization)

Implementing VDM for Generative Modelling

Generative Adversarial Networks

Implementation of GAN in practice

To train the Discriminator

To Train the Generator

Interpretation of GANs as Classifier guided Generative Samplers

Formulation of classifier guided sampler

Deep Convolution GAN (DC Gan)

Conditional GAN (cGAN)

VDM

Improvisations and Applications of GANs

Wasserstein’s GANs (WGANs)

What makes GAN training unstable?

Wasserstein’s Metric (Optimal Transport)

How to minimize Wasserstein’s Metric

Bi-Directional GAN (Bi-GAN)

Inversion of GANs

Bi-Directional GAN

Domain Adversarial Networks

Evaluation of a GAN

Deep Generative Models

Family of Deep Generative Models (DGMs) to be covered -

Generative Adversarial Networks (GANs)
Variational Auto Encoders (VAEs)
Denoising Diffusion Probabilistic Models (DDPMs)
- Score Based Models
Auto Regressive Models (AR)
- Large Language Models (LLMs)
State Space Models (SSMs)
- Example - S4, Mamba
RL-based Alignment for LLMs
- RLHF, PPO, DPO

Any dataset $D = {x_{i}}_{i = 1}^{n}, X_{i} \sim i.i.d P_{x}, x_{i} = X_{i} (ω), x_{i} \in R^{d}$ means that $D$ consists of independent realizations of i.i.d. vector valued random variables $X_{1}, X_{2}, \dots$ of size $d$ , each distributed according to some unknown probability distribution $P_{X}$ .

Goal - Given such a $D$ , the goal of using Generative Models is to estimate $P_{X}$ and learn to sample from it.

Principles of Generative Models -

Assume a parametric family on $P_{X}$ denoted using $P_{θ}$ where $P_{θ}$ is represented using Deep Neural Networks. This is our “model”.
Define and estimate a divergence metric to measure the distance between $P_{X}$ and $P_{θ}$ .
Solve an optimization problem over the parameters of $P_{θ}$ to minimize the divergence metric.

Example

Assume some random variable $z$ with some arbitrary but known distribution $Z$ (because the distribution is known, sampling is possible). Suppose there exists some function $g_{θ} (z) : Z \to X$ .

$\tilde{x} = g_{θ} (z)$ would have an entirely different distribution that that of $z$ and would depend upon the function $g_{θ}$ .
Suppose $g_{θ} (z)$ is a Deep Neural Network and the density of $\tilde{x} = g_{θ} (z)$ is denoted as $P_{θ}$ . We can define a divergence metric $D (P_{X} ∣∣ P_{θ})$ between $P_{θ}$ and $P_{X}$ such that $D (P_{X} ∣∣ P_{θ}) \geq 0$ and $D (P_{X} ∣∣ P_{θ}) = 0$ iff $P_{X} = P_{θ}$ .
Solving the optimization equation $θ^{*} = ar g min_{θ} D (P_{X} ∣∣ P_{θ})$ , would allow us to implicitly estimate $P_{X}$ . This would allow us to sample from $P_{X}$ using $g_{θ *} (z)$ because a random sample chosen from $z$ and then passed through $g_{θ *} (z)$ would be very close to $P_{X}$ .

This method is called a pushforward method as we push a probability mass $Z$ into the data space $X$ using a function $g_{θ}$ .

Obstacles towards implementation -

We know random samples from these distributions, the dataset $D$ from $P_{X}$ and $g_{θ} (z)$ from $P_{θ}$ , but not the distributions themselves. How to compute the divergence metric without knowing the distributions $P_{X}$ and $P_{θ}$ ?
What should the choice of the divergence metric be?
How to choose $g_{θ}$ and in turn $P_{θ}$ ?
How to solve the optimization problem of minimizing the divergence metric?

Variational Divergence Minimization

Define a diverge between two distributions

f-divergence

Given two probability distribution functions with corresponding probability density functions denoted by $P_{X}$ and $P_{θ}$ , the f-divergence between them is -

D_{f} (P_{X} ∣∣ P_{θ}) = \int_{X} P_{θ} (x) f (\frac{P _{X} ( x )}{P _{θ} ( x )}) d x f (u) : R^{+} \to R is a convex, left semi-continous and f (1) = 0 (any function) x : space on which P_{X} and P_{θ} are supported

Convex function - A function which has one unique minimum value (can have multiple minima).
Strictly Convex function - A function which has only one global minima.
Left Semi-Continuous function - A function in which the value at a point is equal to the limit when approached from the left.
Probability density functions are always non-negative, thus their ratio $\frac{P _{X} ( x )}{P _{θ} ( x )}$ will be a positive value which would satisfy the domain of $f$ .
Range space of $P_{X}$ and $P_{θ}$ would be positive scalars despite $x$ being a $d$ -dimensional random vector.

Choice of an $f$ function is what leads to an $f$ -divergence.

Properties of $f$ -divergence -

$D_{f} \geq 0$ for any choice of $f$ .
$D_{f} (P_{X} ∣∣ P_{θ}) = 0$ iff $P_{X} = P_{θ}$ .

Examples of $f$ -divergence -

$f (u) = u l o g u$ leads to the KL (Kullback-Leilber) Divergence -

\int_{X} P_{X} (x) l o g (\frac{P _{X} ( x )}{P _{θ} ( x )}) d x = D_{K L}

K.L Divergence is asymmetric, meaning $Forward K.L. D (P_{X} ∣∣ P_{θ}) \neq = Reverse K.L. D (P_{θ} ∣∣ P_{X})$ .

$f (u) = \frac{1}{2} (u l o g u - (u + 1) l o g (\frac{u + 1}{2}))$ leads to the JS (Jensen-Shannon) Divergence.
$f (u) = \frac{1}{2} ∣ u - 1∣$ leads to the Total Variation Distance or TV Distance.

Algorithm for f-divergence minimization

We need to come up with an algorithm to optimize the $f$ -divergence without knowing what the distributions $P_{X}$ and $P_{θ}$ are but instead by using the samples of $P_{X}$ and $P_{θ}$ known to us (dataset $D$ and output of $g_{θ} (z)$ ).

Key Idea - Integrals involving density functions can be approximated using samples drawn from the distribution.

For an integral like the one shown below, we have i.i.d samples drawn from $P_{X}$ .

I = \int_{X} h (x) P_{x} (x) d x

$(1)$ By the Law of Unconscious Statistician (LOTUS) we know that if $X$ is a random variable with a probability distribution $P_{X}$ and $h$ is some measurable function, then

\int_{X} h (x) P_{x} (x) d x = E_{P_{X}} (h (x))

$(2)$ By the Law of Large Numbers, we know that as the number of samples grows, the sample mean converges to the true expected value of a function $h$ .

n \to \infty lim \frac{1}{n} i = 1 \sum n h (x_{i}) \approx E [h (x)]

So one way to solve an integral like the f-Divergence is by using the above two mentioned laws and equating it to the expected value of function $h (x) = f (\frac{P _{X} ( x )}{P _{θ} ( x )})$ . It would be a mathematically valid representation but is not directly computable from the data since the true data distribution $P_{X}$ is unknown.

Conjugate of a convex function

The conjugate of a convex function $f (u)$ is written as

f^{*} (t) = u \in dom(f) sup {u t - f (u)}

At every point $t$ , one constructs multiple lower bounds on that particular $u$ and chooses the tightest of those lower bounds (supremum/max) as value of the conjugate.

Properties of a conjugate of a convex function -

$f^{*}$ is also a convex function.
$[f^{*} (t)]^{*} = f (u)$

Using these properties of a conjugate of a convex function, we can write $f (u)$ as -

f (u) = t \in dom(f*) sup {t u - f^{*} (t)}

Substituting this in the f-Divergence Integral we get -

D_{f} (P_{X} ∣∣ P_{θ}) = \int_{X} P_{θ} (x) f u \frac{P _{X} ( x )}{P _{θ} ( x )} d x = \int_{X} P_{θ} (x) f (u) d x = \int_{X} P_{θ} (x) t \in dom(f*) sup {t * \frac{P _{X} ( x )}{P _{θ} ( x )} - f^{*} (t)} d x ∵ u = \frac{P _{X} ( x )}{P _{θ} ( x )}

To represent the $f$ -divergence in terms of expectation, we need to somehow take the supremum out of the integral. Since the Fenchel conjugate expresses $f (u)$ as a pointwise supremum, and since $u = \frac{P _{X} ( x )}{P _{θ} ( x )}$ depends on $x$ , the optimizer of the inner problem is generally a function of $x$ . Thus we can take the supremum out and rewrite the equation as -

= T (x) \in T sup \int_{X} P_{θ} (x) {T (x) * \frac{P _{X} ( x )}{P _{θ} ( x )} - f^{*} (T (x))} d x

where $T : X \to dom f*$ is the space of functions containing solutions for the inner optimization problem.

The space of functions $T$ we are optimizing over may or may not containing $T^{*} (x)$ that is the solution to the inner optimization problem. This can occur either because $T$ is a restricted function class (e.g., neural networks), or because the supremum defining the conjugate is not attained within $T$ .

Because we are restricting $T$ and $T \subseteq {all measurable functions}$ , by using the fact for any arbitrary function $F$ that $sup_{t \in A} F (x) \leq sup_{t \in B} F (x)$ if $A \subseteq B$ we can say,

D_{f} D_{f} \geq T (x) \in T sup \int_{X} P_{θ} (x) {T (x) * \frac{P _{X} ( x )}{P _{θ} ( x )} - f^{*} (T (x))} d x \geq T (x) \in T sup \int_{X} P_{X} (x) T (x) d x - \int_{X} P_{θ} (x) f^{*} (T (x)) d x \geq T (x) \in T sup [P_{X} E [T (x)] - P_{θ} E [f^{*} (T (x))]]

For the sake of understanding, we can replace $sup$ in the above equation with $max$ .

D_{f} \geq T (x) \in T max [P_{X} E [T (x)] - P_{θ} E [f^{*} (T (x))]]

Realization of VDM (Variational Divergence Minimization)

We need a $θ^{*}$ that minimizes the $f$ -divergence $D_{f}$ . $D_{f}$ is not optimizable due to $P_{X}$ being unknown. At the true optimum $θ^{*}$ the lower bound on $D_{f}$ becomes a tight bound, thus making the optimization of $D_{f}$ and the optimization of the lower bound equivalent.

θ^{*} = ar g θ min D_{f} \approx ar g θ min [lower bound of D_{f}] = ar g θ min [T (x) \in T max [P_{X} E [T (x)] - P_{θ} E [f^{*} (T (x))]]]

The final objective here requires two optimizations -

Generator Network - The outer minimization w.r.t parameters of $g_{θ} (z)$ .
Critic/Discriminator Network - The inner maximization w.r.t a class of functions $T (x) \in T$ .

With this the objective becomes -

θ^{*}, w^{*} = ar g θ min w max [P_{X} E [T_{w} (x)] - P_{θ} E [f^{*} (T_{w} (x))]]

Neural networks enter the framework when we need to perform the saddle-point optimization of the variational lower bound of the $f$ -divergence, since both the generator distribution and the variational function are infinite-dimensional objects.

The model distribution $P_{θ}$ must be learnt to reduce the value of the $f$ -divergence metric. Neural networks provide a flexible and differentiable parameterization for learning such complex data-generating distributions using the samples.
The function class $T$ , being infinite dimensional, is difficult to characterize explicitly. As neural networks are universal function approximators, we represent $T$ using neural networks $T_{w} (x)$ where $w$ are the parameters of the neural network.

Implementing VDM for Generative Modelling

Now we have two neural networks, one representing the sampler (Generator Network) and the other representing the $T$ function class (Critic Network).

J (θ, w) θ^{*}, w^{*} = P_{X} E [T_{w} (x)] - P_{θ} E [f^{*} (T_{w} (x))] = ar g θ min w max J (θ, w)

Such problems are called a saddle-point optimization problems as we are looking for saddle points here or an adversarial optimization problem because whatever one network is trying to do is the opposite of what the other network is trying to do.

Usually we avoid saddle-points because they are neither a global minima nor a global maxima, but this is one of the rare instances where we seek a saddle point.

By construction $T_{w} : X \to dom f^{*}$ is dependent upon the particular choice of $f$ . $T_{w}$ can be rewritten as $T_{w} = σ_{f} (V_{w} (x))$ to guarantee that the critic’s output lies in the domain of $f^{*}$ while keeping the network architecture independent of our choice of the $f$ -divergence. Here -

$V_{w} (x) : X \to R$ is a function common across all $f$ -divergence
$σ_{f} (v) : R \to dom f^{*}$ is an $f$ -divergence specific activation function (need not always be the sigmoid function)

So we can rewrite our loss function as -

J (θ, w) = P_{X} E [σ_{f} (V_{w} (x))] - P_{θ} E [f^{*} (σ_{f} (V_{w} (x)))]

Generative Adversarial Networks

For GANs the $f$ -divergence that is chosen is as follows -

f (u) f^{*} (t) σ_{f} (v) = u l o g u - (u + 1) l o g (u + 1) = - l o g (1 - e x p (t)), = - l o g (1 + e^{- v}) (Similar to f used in JS-divergence) dom f^{*} = R^{-} (log-sigmoid)

The log-sigmoid as $σ_{f}$ ensures that the output of the critic always lies in $dom f^{*} = R^{-}$ . Using the above information, we can write a GAN specific loss function using the equation of a loss function we got earlier.

J_{G A N} (θ, w) where D_{w} (x) = P_{X} E [l o g D_{w} (x)] + P_{θ} E [l o g (1 - D_{w} (x))] = \frac{1}{1 + e ^{- V_{w} (x)}} (the sigmoid function & D_{w} (x) \in {0, 1})

Observations to note -

Due to the rearrangement of terms again satisfying the lower-bound equation, we can see that $T_{w} (X) = l o g D_{w} (x)$ .
It can also be seen that this objective function strangely resembles the cross-entropy loss.
The discriminator would try to maximize this “cross-entropy” by maximizing the objective function. Since the cross-entropy is large when real and fake samples are easily distinguishable, maximizing it encourages the discriminator to separate samples from $P_{X}$ and $P_{θ}$ as effectively as possible. Thus it is called the “discriminator.”
The generator would try to minimize this “cross-entropy” by only minimizing the second term of the objective function. In general this second term penalizes incorrect predictions made with a high probability. In this case it would be penalizing the fake samples the discriminator confidently distinguishes as fake. By minimizing it, the generator encourages the discriminator to assign higher probabilities to generated samples, which implicitly pushes the generator distribution $P_{θ}$ towards the data distribution $P_{X}$ .
The above two observations are solidifying more in Formulation of classifier guided sampler.
The discriminator is trying to assign a low probability to the generated samples while the generator pushes the discriminator to assign them a high probability. This causes the adversarial nature of the networks.

The architecture ends up looking like the image below. $l o g$ doesn’t need to be included in the discriminator network as it is not a part of the networks but the error function.

Implementation of GAN in practice

Pre-requisites -

Input $D = {x_{1}, x_{2}, \dots} \sim i.i.d P_{X}$ .
Let $B_{1} = {x_{1}, x_{2}, \dots, x_{B_{1}}} \subset D$ (need not be contiguous, is usually random) be a mini-batch taken from the dataset.
Let $B_{2} = {z_{1}, z_{2}, \dots, z_{B_{2}}}$ be a batch of random noise vectors taken from the arbitrary distribution we chose, typically $N (0, 1)$ .

As we are using these batches during Mini-Batch gradient descent, these batches are resampled each time before an update.

To train the Discriminator

Optimizing for $w$ -

w^{*} = ar g w max J_{G A N} (θ, w) = ar g w max [P_{X} E [l o g D_{w} (x)] + P_{θ} E [l o g (1 - D_{w} (x))]] \approx ar g w max [\frac{1}{B _{1}} i = 1 \sum B_{1} l o g D_{w} (x_{i}) + \frac{1}{B _{2}} i = 1 \sum B_{2} l o g (1 - D_{w} (\overset{x}{^}_{i}))] ∵ Using batch size

Steps -

Resample $B_{1}$ and $B_{2}$ .
Pass $x_{i} \in B_{1}$ through the discriminator $D_{w}$ to get $D_{w} (x_{i})$ . Then using these calculate the first term of the loss function -

\frac{1}{B _{1}} i = 1 \sum B_{1} l o g D_{w} (x_{i})

Keeping $θ$ constant, pass the random noise vectors $z_{i} \in B_{2}$ through the generator to generate fake samples $\overset{x}{^}_{i} = g_{θ} (z_{i})$ . Pass this through the discriminator to get $D_{w} (\overset{x}{^}_{i})$ . Using these calculate the second term of the loss function -

\frac{1}{B _{2}} i = 1 \sum B_{2} l o g (1 - D_{w} (\overset{x}{^}_{i}))

Calculate the gradient $w \nabla J_{G A N} (θ, w)$ and backpropagate the gradients back through the discriminator network.
Update $w$ using gradient “ascent”.

w^{t + 1} = w^{t} + α_{1} w \nabla J_{G A N} (θ, w)

To Train the Generator

Optimizing for $θ$ -

θ^{*} = ar g θ min J_{G A N} (θ, w) = ar g θ min [P_{X} E [l o g D_{w} (x)] + P_{θ} E [l o g (1 - D_{w} (x))]] \approx ar g θ min [\frac{1}{B _{1}} i = 1 \sum B_{1} l o g D_{w} (x_{i})^{Independent of θ} + \frac{1}{B _{2}} i = 1 \sum B_{2} l o g (1 - D_{w} (\overset{x}{^}_{i}))] = ar g θ min [\frac{1}{B _{2}} i = 1 \sum B_{2} l o g (1 - D_{w} (\overset{x}{^}_{i})] ∵ Second term stays as \overset{x}{^}_{i} = g_{θ} (z_{j}) \in P_{θ}

Steps -

Resample $B_{2}$ .
Keeping $w$ constant, pass the random noise vectors $z_{i} \in B_{2}$ through the generator to generate fake samples $\overset{x}{^}_{i} = g_{θ} (z_{i})$ . Pass this through the discriminator to get $D_{w} (\overset{x}{^}_{i})$ . Using these calculate the term of the loss function -

\frac{1}{B _{2}} i = 1 \sum B_{2} l o g (1 - D_{w} (\overset{x}{^}_{i}))

Calculate the gradient $θ \nabla J_{G A N} (θ, w)$ and backpropagate the gradients back through the discriminator network and then through the generator network. The discriminators parameters are not updated as $w$ is set to constant.
Update $θ$ using gradient descent.

θ^{t + 1} = θ^{t} - α_{2} θ \nabla J_{G A N} (θ, w)

We have to solve these optimization problems alternatively. First update the generator parameters while keeping the discriminator parameters are constant, then update the discriminator parameters while keeping the generator parameters are constant.

Interpretation of GANs as Classifier guided Generative Samplers

In GANs the discriminator network $D_{w}$ acts as a binary classifier which is able to predict whether a given sample belongs to $P_{X}$ or $P_{θ}$ .

*Can this classifier be used to bring $P_{X}$ and $P_{θ}$ closer?

Yes! If $P_{X}$ and $P_{θ}$ were sufficiently close to each other, even an optimal discriminator would fail to distinguish the samples between them. So we can tweak $θ$ till the classifier fails to distinguish between samples of $P_{X}$ and $P_{θ}$ .
However, the failure of some weak or arbitrary classifier doesn’t imply that $P_{X}$ and $P_{θ}$ are close to each other. So instead of just tweaking $θ$ , we also need to change the binary classifier $D_{w}$ . If all such different variations of the classifiers fail to distinguish the samples then we can confidently say that $P_{X}$ and $P_{θ}$ are close to each other.

Mode Collapse Problem -

During the updation of the generator and the classifier, it can happen that $θ$ and the classifier keep alternating between two values and get stuck in an infinite loop. This is called the mode collapse problem.

Formulation of classifier guided sampler

Creating the classifier -

Let $D_{w} : X \to [0, 1]$ represent the probability of the sample $x$ coming from $P_{X}$ .
We’d need to maximize the log-likelihood of $x$ coming from $P_{X}$ when $x \sim P_{X}$ (probability of real samples being marked as real) and also maximize the log-likelihood of $\overset{x}{^}$ coming from $P_{θ}$ when $\overset{x}{^} \sim P_{θ}$ (probability of fake samples being marked as fake). So the combined objective of the classifier becomes maximizing the sum of these two values.

E [l o g D_{w} (x)] E [l o g (1 - D_{w} (x))] w^{*} = ar g w max : log-likelihood that x \sim P_{X} : log-likelihood that \overset{x}{^} \sim P_{θ} [P_{X} E [l o g D_{w} (x)] + P_{θ} E [l o g (1 - D_{w} (\overset{x}{^}))]]

Similarly for the generator -

The objective for the generator is that the classifier has to “fail” in distinguishing samples from $P_{X}$ and $P_{θ}$ . So the goal here is to minimize the discriminator’s confidence that fake samples $\overset{x}{^}$ coming from $P_{θ}$ are fake.

θ^{*} = ar g θ min [P_{θ} E [l o g (1 - D_{w} (\overset{x}{^}))]]

Thus the overall objective becomes -

θ^{*}, w^{*} = ar g θ min w max [P_{X} E [l o g D_{w} (x)] + P_{θ} E [l o g (1 - D_{w} (\overset{x}{^}))]]

Deep Convolution GAN (DC Gan)

(intentionally left incomplete for now)

Conditional GAN (cGAN)

To make conditional GANs that sample from a conditional distribution, we pass the conditional variable $y$ through both the generator and the discriminator.

Instead of just the noise $z \sim p (z)$ we pass $(z, y)$ so the generator learns $P_{θ} (x ∣ y)$ .
Instead of just seeing $x$ , the discriminator sees $(x, y) \to D_{w} (x, y)$ . It now answers “Is $x$ a real sample consistent with condition $y$ “.

The objective function changes to -

J (θ, w) = (x, y) \sim P_{X ∣ Y} E [l o g D_{w} (x)] + (\overset{x}{^}, y) \sim P_{\hat{X} ∣ Y} E [l o g (1 - D_{w} (\overset{x}{^}))]

Inference with GANs/VDM

Suppose $g_{θ}^{*}$ is the optimal generator network achieved via training. For any test input $z_{t es t} \sim N (0, 1)$ and class label $y$ , the output would be a $X_{t es t}$ corresponding to the class-label specified by $y$ .

Improvisations and Applications of GANs

Wasserstein’s GANs (WGANs)

What makes GAN training unstable?

Manifold Hypothesis -

Images in the real world lie in a lower dimension manifold of the ambient space $R^{d}$ .
Consider all $28 \times 28$ images such that a pixel is 1 if a coin toss results in a heads, else 0. The probability of an image generated in such a manner resembling an English Alphabet is very low. So we can say that the images depicting an English Alphabet lie in a low dimensional manifold on the ambient space ${0, 1}^{784}$ .
Similarly an image of any natural object lies in a lower dimension manifold of the image’s dimension $R^{d}$ .

The real distribution $P_{X}$ and the generated distribution $P_{θ}$ are both distributions over $R^{784}$ . Since the real data and the generated data would lie over a lower dimensional manifold of $R^{d}$ , the supports (set on which the corresponding density function has a non-zero value) of $P_{X}$ and $P_{θ}$ would misalign with a very high probability.

When the supports of $P_{X}$ and $P_{θ}$ misalign, it can be shown that a perfect discriminator would always exist. In such a case the gradient of the discriminator becomes 0, due to which the discriminator parameters and the generator parameters can’t be updated any further (gradients flow through the discriminator into the generator). Thus GAN training saturates.

Remedies for GAN training saturation -

Train the generator and the discriminator at different training ratios, usually training the generator more.
Instead of using the $f$ -divergence metric $D_{f} (P_{x} ∣∣ P_{θ})$ which becomes independent of the generator parameters $θ$ when GAN training saturates, use a “softer” metric which does not saturate when the manifolds of the supports of $P_{X}$ and $P_{θ}$ misalign.

Wasserstein’s Metric (Optimal Transport)

Given two distributions $P_{X}$ and $P_{\hat{X}}$ ,

W (P_{X} ∣∣ P_{\hat{X}}) λ Π (X, \hat{X}) = λ \in Π (X, \hat{X}) min [λ (x, \overset{x}{^}) E ∣∣ x - \overset{x}{^} ∣ ∣_{2}] : Joint distribution b/w P_{X}, P_{\hat{X}} : All Joint distributions such that - \int_{X} Π (X, \hat{X}) d x = P_{\hat{X}} \int_{\hat{X}} Π (X, \hat{X}) d x = P_{X}

Here $W (P_{X} ∣∣ P_{\hat{X}})$ corresponds to a “transport plan” that requires the least amount of “work done” in redistributing $P_{\hat{X}}$ to be similar to $P_{X}$ .

Derivation (Not Rigorous) -

Imagine piles of dirt here to be $P_{\hat{X}}$ and $P_{X}$ to be a pile of dirt shaped like a bell curve. The minimum transport plan just tells you the best plan to shovel dirt in between the two piles such that the final pile resembles $P_{X}$ .

But how does shoveling dirt relates to redistributing distributions? Every joint distribution between the two distributions $P_{\hat{X}}$ and $P_{X}$ can be written as a table whose each row and column sum to 1.

	$x_{1}$	$x_{2}$	$\dots$	$x_{k}$
$\overset{x}{^}_{1}$	$0.1$	$0.6$	$\dots$	$0.05$
$\overset{x}{^}_{2}$	0.3	0.2	$\dots$	0.3
$⋮$	$⋮$	$⋮$	$⋮$	$⋮$
$\overset{x}{^}_{k}$	0.9	0.1	$\dots$	0

This joint distribution between the two marginals $P_{X}$ and $P_{\hat{X}}$ is a transport plan which tells how one of the marginals can be transformed into the other. To quantify the “effort” required in a transport plan, we can use the concept of work done ( $W = F \times d$ ) from Physics. In our context -

∣∣ x - \overset{x}{^} ∣ ∣_{2} λ (x, \overset{x}{^}) λ (x, \overset{x}{^}) \cdot ∣∣ x - \overset{x}{^} ∣ ∣_{2} : distance moved : amount of mass moved : "work done" in moving the mass

The average work done required in a transport plan is -

\int_{X, \hat{X}} λ (x, \overset{x}{^}) \cdot ∣∣ x - \overset{x}{^} ∣ ∣_{2} d x d \overset{x}{^} = λ (x, \overset{x}{^}) E [∣∣ x - \overset{x}{^} ∣ ∣_{2}]

Let $Π (X, \hat{X})$ be the family of all joint distributions/transport plans between the marginals $P_{X}$ and $P_{\hat{X}}$ We are interested in finding the transport plan which requires the least amount of work done -

λ \in Π (X, \hat{X}) min [λ (x, \overset{x}{^}) E ∣∣ x - \overset{x}{^} ∣ ∣_{2}]

This is the Wasserstein’s Metric.

How to minimize Wasserstein’s Metric

The Wasserstein’s Metric is a minimization problem. Every minimization problem has a dual maximization problem. One such result of min-max duality is the Kantrovic-Rubenstein’s Duality states -

W (P_{x} ∣∣ P_{θ}) ∣∣ T_{w} (x) ∣ ∣_{L} = ∣∣ T_{w} (x) ∣ ∣_{L} < 1 max [P_{X} E [T_{w} (x)] - P_{θ} E [T_{w} (\overset{x}{^})]] < 1 : 1-Lipschitz

Any function $f$ being 1-Lipschitz means that the function cannot change faster than the distance -

\frac{∣∣ f ( x _{1} ) - f ( x _{2} ) ∣∣}{∣∣ x _{1} - x _{2} ∣∣} < 1

The $T_{w}$ in this case is a neural network and can be made 1-Lipschitz by normalizing the weights of $T_{w}$ such that $∣∣ w ∣ ∣_{2} = 1$ after each gradient step.

$θ^{*}$ has to be such that the Wasserstein’s distance is to be minimized. The Kantrovic-Rubenstein’s Duality enables us to express the Wasserstein’s distance in terms of expectations of $P_{X}$ and $P_{θ}$ .

θ^{*}, w^{*} = ar g θ min ∣∣ T_{w} (x) ∣ ∣_{L} < 1 max [P_{X} E [T_{w} (x)] - P_{θ} E [T_{w} (\overset{x}{^})]]

The above objective is very similar to GANs. That’s why this method of minimizing the Wasserstein’s metric is called the WGAN. Training a WGAN is more stable than training a Naive-GAN.

Bi-Directional GAN (Bi-GAN)

Inversion of GANs

We train a GAN specifically to allow us to sample $x$ from the dataset distribution $P_{X}$ by picking a random sample $z$ from an arbitrary distribution $Z$ and passing it through the generator $g_{θ}$ . But how can we get back $z$ if we know $x$ ?

Inversion is useful for -

Feature Extraction - Knowing any sample $x$ and the inversion of our GAN can allow us to obtain GAN-inverted vectors and use them as features for the data.
Data Manipulation/Editing - Suppose a GAN is trained on images. If we wish to edit an image using the GAN itself, we need to know the corresponding latent vector (input vector $z$ ) for that image and edit this vector in such a way that the image corresponding to this edited input is our desired output image.

Bi-Directional GAN

In Bi-GAN, in addition to the generator and the discriminator networks there’s also the Encoder/Inverter network denoted as $E_{ϕ} : X \to Z$ .

Here the discriminator won’t just take $x$ and $g_{θ} (z)$ as the inputs to distinguish between but take a pair of values $(x, E_{ϕ} (x))$ and $(g (z), z))$ as the input and attempt to distinguish the joint pairs. If the discriminator fails to do so, we can say that $(x, E_{ϕ} (x)) \sim (g (z), z)$ .

The objective function is -

L_{B i G A N} (θ, w, ϕ) θ^{*}, w^{*}, ϕ^{*} where, = x \sim P_{X} E [\overset{z}{^} \sim P_{ϕ} E [l o g D_{w} (x, E_{ϕ} (x))]] + z \sim Z E [\overset{x}{^} \sim P_{θ} E [l o g {1 - D_{w} (x, E_{ϕ} (x))}]] = ar g θ, ϕ min w max L_{B i G A N} (θ, w, ϕ) z \sim Z, \overset{x}{^} \sim P_{θ}, \overset{z}{^} \sim P_{ϕ}

Here $g_{θ}$ and $E_{ϕ}$ are trained simultaneously using the discriminator. Once train $g_{θ^{*}}$ acts as the generator and $E_{ϕ^{*}}$ is used for inversion. It can be shown that,

P_{\hat{Z} X} P_{\hat{Z} X} P_{Z \hat{X}} = P_{Z \hat{X}} where, = \int_{X} P_{X} (x) \int_{\hat{Z}} P_{ϕ} (\overset{z}{^} ∣ x) d \overset{z}{^} d x = \int_{Z} P_{Z} (x) \int_{\hat{X}} P_{ϕ} (\overset{x}{^} ∣ z) d \overset{x}{^} d z

Domain Adversarial Networks

Suppose we have a source dataset and target dataset such that both belong to a different distribution.

D_{s} D_{t} = {(x_{i}, y_{i})}_{i = 1}^{n} = {(\overset{x}{^}_{j})}_{j = 1}^{m} \sim P_{s} \sim P_{t}

Any classifier/regressor trained solely on $D_{s}$ would fail to predict for the target items in $D_{t}$ . We can use Domain Adversarial Networks here to train a classifier that is domain agnostic (able to classify independent on which domain element belong to).

In Domain Adversarial Networks we have -

An Encoder $ϕ : X \to F$ to extract features from inputs regardless of which domain the inputs belong to (both $D_{s}$ and $D_{t}$ ).
A Discriminator $T_{w} : F \to [0, 1]$ to distinguish between elements of $P_{s}$ and elements of $P_{t}$ (Features of both source and target data).
A Classifier/Regressor $h_{ψ} : F_{s} \to y_{s} \sim P_{s} (y ∣ x)$ which uses the features of the source inputs to make a prediction regarding them.

Here the Discriminator makes the Encoder better at constructing features from the inputs (both source and target) in such a way that the features appear domain agnostic. But just having domain agnostic features isn’t all, they need to be useful for predicting the target class. For this we include a Classifier/Regressor as well in the network so that the features learnt are both domain agnostic and useful.

ϕ^{*}, w^{*} ψ^{*} = ar g ϕ min w max [P_{F_{s}} E [l o g D_{w} (F_{s})] + P_{F_{t}} E [l o g (1 - D_{w} (F_{t}))]] = ar g ψ min BCE (y, h_{ψ} (F_{s})) (BCE=Binary Cross-Entropy)

Evaluation of a GAN

Suppose we have some true and generated samples and we wish to evaluate whether the GAN is successful in generating samples from $P_{X}$ . There are various methods for it, but we’d be look at an adversarial method of evaluation called Frechet Inception Distance. FID uses Wasserstein’s Metric along with Inception Network trained on Imagenet to do this evaluation.

Let -

D_{t r u e} D_{g e n} = {x_{i}}_{i = 1}^{n} \sim iid P_{X} = {\overset{x}{^}_{j}}_{j = 1}^{m} \sim iid P_{θ}

Take a pretrained Inception Network trained on the Imagenet dataset. Let this be denoted as $I_{ψ}$ .
Pass $D_{t r u e}$ and $D_{g e n}$ through $I_{ψ}$ and extract the features for $D_{t r u e}$ and $D_{g e n}$ from some $l^{t h}$ layer of $I_{ψ}$ .

\hat{D}_{t r u e} \hat{D}_{g e n} = {z_{t r u e}^{i}}_{i = 1}^{n} \sim iid P_{X} = {\overset{z}{^}_{g e n}^{i}}_{j = 1}^{m} \sim iid P_{θ}

Compute the mean and covariance for $\hat{D}_{t r u e}$ and $\hat{D}_{g e n}$ .

μ_{t r u e} Σ_{t r u e} Similarly = \frac{1}{n} i = i \sum n z_{t r u e}^{i} = \frac{1}{n} i = i \sum n (z_{t r u e}^{i}) \cdot (z_{t r u e}^{i})^{T} μ_{g e n} and Σ_{g e n} too

Assume that the features of the true and generated data come from a Gaussian distribution of corresponding mean and variance.

Z_{t r u e} Z_{g e n} \sim N (μ_{t r u e}, Σ_{t r u e}) \sim N (μ_{g e n}, Σ_{g e n})

Calculate the Wasserstein’s Metric between $Z_{t r u e}$ and $Z_{g e n}$ . This is the FID, our evaluation metric of the GAN.

FID = W (Z_{t r u e}, Z_{g e n}) = ∣∣ μ_{t r u e} - μ_{g e n} ∣ ∣_{2}^{2} + trace (Σ_{t r u e} + Σ_{g e n} - 2 (Σ_{t r u e} \cdot Σ_{g e n}))

$P_{X}$ and $P_{θ}$ would never be equal because of the heavy amount of approximations made for VDM. The GAN trained is better if the FID is lower. Lower FID means lower distance between $P_{X}$ and $P_{θ}$ .

GATE & BS Notes

Explorer

VDM and GANs