CS236: Deep Generative Models
Last tended to on May 19, 2022
Introduction
Feynman: “What I cannot create, I do not understand”
Generative modeling: “What I understand, I can create”
How to generative natural images with a computer?
Generation: High level description => raw sensory outputs
Inference: raw sensory outputs => high level description
Statistical Generative Models
are learned from data. (Priors are necessary, but not as strict as in Graphics)
Data = samples
Priors = parametric (e.g. Gaussian prior), loss function, optimization algo, etc.
Image
Sampling from
Discriminative vs. generative
Discriminative model: input
Generative model: input
Conditional generative models
They blur the line between generative and discriminative, because they also condition on some input features.
Superresolution: p(high-res signal | low-res signal)
Inpainting: p(full image | mask)
Colorization: p(color image | greyscale)
Translation: p(English text | Chinese text)
Text-to-Image: p(image | caption)
Background
What is a generative model?
We are given a dataset of examples, e.g. images of cats.
Generation: Then, if we sample
Density estimation:
Unsupervised: We should be able to learn what cats have in common, e.g. ears, tail, etc. (features!)
Structure through independence
Consider an input with several components
However, this assumption is too strong – oftentimes, components are highly correlated (like pixels in an image.)
Chain rule – fully general, no assumption on the joint. but the conditionals toward the end become large and intractable. Way too many parameters.
Need a better simplifying assumption in the middle…
Conditional independence assumption
Actually this is just a special case of Bayes network, where it’s like a line of nodes
x1 => x2 => x3 => xn-1 => xn.
Bayes Network / graphical models:
This is a directed acyclic graph with one node with each random variable, and one conditional probabiliy distribution per node.
each random variable depends on some parents
This implies conditional independences between variables that aren’t direct parent-child, given their parents(?).
Use neural networks to represent the conditional distributions.
Naive Bayes:
Assume that all the inputs are independent conditioned on y. (another special case of a Bayes net)
directly estimate the conditionals p(xi|y) from data. => use those + bayes rule to calc p(y|x)
Discriminative vs. generative
p(y,x) = p(x|y)p(y) = p(y|x)p(x)
Generative: need to learn/specify both p(y), p(x|y)
Discriminative: just need to learn p(y|x) (X is always given)
Discriminative assumes that p(y|x;a) = f(x;a) (assumes that the probability distribution takes a certain functional form.)
E.g. logistic regression. Modeling p(y|x) as a linear combination of the inputs => squeeze with softmax. Decision boundaries are straight lines (assumption of logistic regression.) Logistic does not assume conditional independence like Naive Bayes does.
Using a conditional model is only possible when X is always observed. when some Xi are unobserved, the generative model allows us to compute p(Y|Xevidence) by marginalizing over unseen.
Autoregressive Models
Bayes net with modeling assumptions:
- model using chain rule (fully general)
assume the conditionals take functional form (e.g., a logistic regression)
Autoregressive Models are often slower than transformers CNNs etc.
Fully Visible Sigmoid Belief Network (FVSBN)
\(p\left(X_{i}=1 \mid x_{
Neural Autoregressive Density Estimation (NADE)
simple: model as Bernoulli
more classes: model as Categorical
RNADE: continuous- model as mixture of Gaussians
Like FVSBN, but use a 1-hidden-layer neural net:
\(\mathrm{h}_{i}=\sigma\left(A_{i} \mathrm{x}
Problem: lots of redundant parameters. Solution: “tie” the weights:
Tie weights to reduce the number of parameters and speed up computation (see blue dots in the figure):
$$
RNADE
\(\mathrm{h}_{i}=\sigma\left(W_{\cdot,
Autoregressive Autoencoder: Masked Autoencoder for Distribution Estimation (MADE)
Use masks to disallow certain paths in an autoencoder to make it autoregressive.
Solution: use masks to disallow certain paths (Germain et al., 2015). Suppose ordering is
- The unit producing the parameters for
is not allowed to depend on any input. Unit for only on . And so on… - For each unit in a hidden layer, pick a random integer
in . That unit is allowed to depend only on the first inputs (according to the chosen ordering). - Add mask to preserve this invariant: connect to all units in previous layer with smaller or equal assigned number (strictly
in final layer)
Links to “Autoregressive Autoencoder: Masked Autoencoder for Distribution Estimation (MADE)”
- Masked Autoregressive Flow (MAF) (Normalizing Flow Models > Autoregressive Models as Normalizing Flow Models > Masked Autoregressive Flow (MAF))
Forward mapping from
- Let
. Compute - Let
. Compute
Sampling is sequential and slow (like autoregressive):
time
Forward mapping from
- Let
. Compute - Let
. Compute
Sampling is sequential and slow (like autoregressive):
time
Inverse mapping from
:
- Compute all
(can be done in parallel using e.g., MADE) - Let
(scale and shift) - Let
- Let
Jacobian is lower diagonal, hence efficient determinant computation Likelihood evaluation is easy and parallelizable (like MADE)
Layers with different variable orderings can be stacked
- Let
RNNs
Challenge: the history for these autoregressive models keeps getting longer and longer. Ideally we’d just have a fixed-size “summary” of the history.
Transformers
masked self-attention preserves autoregressive structure.
PixelCNN
Masked convolutions preserve raster scan order.
Lol u can use these for adversarial attacks –
WaveNet
Learning a generative model (Maximum Likelihood)
We are given a dataset
We are given a family of models M, our task is to learn a “good” model Mhat in M that defines a distribution pmhat
Can’t capture the exact distribution. All we have are samples – that’s very sparse coverage over the space of all possible samples. (So…we need regularization / priors / inductive biases.)
KL divergence:
“distance” between two distributions,
However, it’s not quite a “distance,” because it’s asymmetric.
Intuition: we have a known distribution
Detour: compression
Generative models are basically compression schemes. Trying to compress the data as well as we can.
To compress, it is useful to know the probability distribution the data is sampled from
For example, let X1, · · · , X100 be samples of an unbiased coin. Roughly 50 heads and 50 tails. Optimal compression scheme is to record heads as 0 and tails as 1. In expectation, use 1 bit per sample, and cannot do better
Suppose the coin is biased, and P[H] ≫ P[T]. Then it’s more efficient to uses fewer bits on average to represent heads and more bits to represent tails, e.g.
Batch multiple samples together
Use a short sequence of bits to encode HHHH (common) and a long sequence for TTTT (rare).
Like Morse code: E = •, A = •−, Q = − − •−
KL-divergence: if your data comes from p, but you use a scheme optimized for q, the divergence DKL(p||q) is the number of extra bits you’ll need on average
Minimizing KL divergence is equivalent to maximizing the expected log likelihood. => Maximum Likelihood Estimation
MLE Learning: Stochastic Gradient Descent
Empirical risk minimization can easily overfit the data.
Bias-variance trade off:
Bias limitation: If the hypothesis space of functions is very limited, we might not be able to represent the data distribution.
Variance limitation: If the hypothesis space is too expressive, it will overfit to the data.
How to prevent overfitting? Prefer “simpler” models (Occam’s razor.) Regularization in the objective function. Evaluate on validation set while training.
Latent Variable Models
Motivation
There are lots of variability in images due to high-level semantic factors: gender, eye color, pose, etc.
Idea: model these factors using latent variables
If you choose
We could identify the factors of variation using these generative models – e.g. p(eye color = blue | x)
Deep Latent Variable Models
We hope that z will capture useful factors of variation in an unsupervised manner. Training a classifier on top of z could be a lot easier.
Features computer via
Mixture of Gaussians: a shallow latent variable model
A mixture of
Variational Autoencoder (VAE)
A mixture of an infinite number of gaussians (since z is continuous):
Simple example:
Good stuff about latent variable models: complex models, natural for unsupervised learning
Hard stuff about latent variable models: learning in unsupervised manner is very difficult
Normalizing Flow Models
Autoregressive models provide tractable likelihoods but no direct mechanism for learning features.
Variational autoencoders can learn feature representations (via latent variables
Normalizing flow models have both latent variables and tractable likelihoods.
We want
This is similar to a VAE:
- Start from a simple prior
- Transform the sample via
- Problem:
is expensive to compute. - What if we could easily “invert”
and compute by design? => we want to be a deterministic and invertible function.
We are going to exploit the Change of Variables formula.
Change of variables
Change of variables (1D case): If
This result comes from chain rule on the PDF.
This allows us to change the distribution of
Intuition: if you’re expanding in one direction, you’re contracting in the other.
All this intuition carries over to random vectors (not just random variables.) See slides for more.
Change of variables (General case): The mapping between
Equivalently, since
Note:
It’s kinda like a VAE, but
Note:
Flow of transformations
A flow of transformations: invertible transformations can be composed together.
By change of variables
By adding more “layers” in the transformation (i.e. a deeper neural net,) we get something increasingly complexified from the prior.
Desiderata for flow models
The prior
The transformations should have tractable evaluation in both directions.
Computing the likelihoods
Key idea: Choose transformations so that their Jacobians have a “special” structure; e.g. the determinant of a triangular matrix is the product of the diagonals; this is
^how do we get that to happen? Some possibilities:
- Make
only depend on . - More efficient ways of computing Jacobians that are “close” to the identity matrix (Planar flows paper.)
Nonlinear Independent Components Estimation (NICE)
Partition the variables
Forward mapping (z=>x):
Reverse mapping (x=>z):
Jacobian:
Since the determinant is 1, it is a volume preserving transformation. (No expanding/contracting)
- Invertible
- Easy to compute
- Tractable marginal likelihood
Additive coupling layers can be composed together.
Final layer of NICE applies a rescaling transformation (so we can change the volume.)
Forward mapping
where
Inverse mapping
Jacobian of forward mapping:
Real-NVP: Non-volume preserving extension of NICE.
Same as NICE, but rescaling happens at each layer.
Forward mapping
(identity transformation) and are both neural networks with parameters input units, and output units denotes elementwise product
Inverse mapping
(identity transformation)
Jacobian of forward mapping:
Non-volume preserving transformation in general since determinant can be less than or greater than 1
Autoregressive Models as Normalizing Flow Models
We can view autoregressive models as flow models.
Consider a Gaussian autoregressive model:
\[
p(\mathrm{x})=\prod_{i=1}^{n} p\left(x_{i} \mid \mathrm{x}
such that \(p\left(x_{i} \mid \mathrm{x}_{1\) and constants for
Sampler for this model:
- Sample
for - Let
. Compute - Let
. Compute - Let
Flow interpretation: transforms samples from the standard Gaussian
Masked Autoregressive Flow (MAF)
Forward mapping from
- Let
. Compute - Let
. Compute
Sampling is sequential and slow (like autoregressive):
Forward mapping from
- Let
. Compute - Let
. Compute
Sampling is sequential and slow (like autoregressive):
Inverse mapping from
- Compute all
(can be done in parallel using e.g., MADE) - Let
(scale and shift) - Let
- Let
Jacobian is lower diagonal, hence efficient determinant computation Likelihood evaluation is easy and parallelizable (like MADE)
Layers with different variable orderings can be stacked
- Links to “Masked Autoregressive Flow (MAF)”
- Normalizing Flow Models (Normalizing Flow Models > Autoregressive Models as Normalizing Flow Models > Inverse Autoregressive Flow (IAF))
Identical to MAF, but change the roles of z and x.
- Normalizing Flow Models (Normalizing Flow Models > Autoregressive Models as Normalizing Flow Models > Inverse Autoregressive Flow (IAF))
Inverse Autoregressive Flow (IAF)
Identical to MAF, but change the roles of z and x.
Computational tradeoffs of MAF vs. IAF
MAF: Fast likelihood evaluation, slow sampling
^good for training
IAF: Fast sampling, slow likelihood evaluation
^good for inference
Parallel Wavenet
Idea: best of both worlds…teacher MAF model, student IAF model. First train MAF model normally. Then train IAF model to minimize divergence with MAF model. Use IAF model at test-time.
Probability density distillation: Student distribution is trained to minimize the
Evaluating and optimizing Monte Carlo estimates of this objective requires:
- Samples
from student model (IAF) - Density of
assigned by student model (IAF) - Density of
assigned by teacher model (MAF)
All operations above can be implemented efficiently.
Invertible CNNs
It’s possible to change a convolutional architecture to become invertible.
We can use masked convolutions to enforce ordering => Jacobian is lower triangular + easy to compute. If all the diagonal elements of Jacobian are positive, the transformation is invertible.
The point is, you can train a ResNet normally, then invert + use as a flow model.
MintNet
uses masked/causal convolutions in a way enforces ordering, makes the Jacobian triangular, makes the transformation invertible..
Gaussianization flows
Let
Flow models are trained with maximum likelihood to minimize the
It can be shown that
Inverse CDF trick
Inverse CDF gives you data samples from a distribution. E.g. Inverse Gaussian composed with
Step 1: Dimension-wise Gaussianization
Step 2: apply a rotation matrix to the transformed data
repeat Step 1 and Step 2 (“stack” these models) => eventually Gaussian.
Generative Adversarial Networks (GANs)
Autoregressive and VAEs use maximum likelihood training over the marignal likelihood (or an approximation, at least.) But why maximum likelihood? => higher likelihood = better lossless compression.
But…let’s say our goal isn’t compression, but high-quality samples. Granted…the optimal generative model will maximize both sample quality and log-likelihood. However, in real life, nothing is perfect, and for imperfect models, high likelihood != good sample quality. (Can have great likelihoods, but terrible samples, or terrible likelihoods but great samples.)
Likelihood-free learning consider objectives that do not depend directly on a likelihood function.
When we don’t have access to likelihood, we can’t depend on KL divergence to optimize. Need a new way of quantifying distance.
Given a finite set of samples from two distributions
New objective: train the generative model to minimize a two-sample test objective between
But…ok in the generative modeling setting, we know that
Two-sample test via a Discriminator
A neural net that tries to distinguish “real” from “fake” samples.
Maximize the two-sample test objective (in support of the hypotehsis
Training objective for discriminator:
For a fixed generator
- Assign probability 1 to true data points
- Assing probability 0 to fake samples
Optimal discriminator
(We don’t want to use likelihoods, though.)
GANs are basically a two-player minimax game between a generator and discriminator.
Generator
Directed, latent variable model with a deterministic mapping between
Training objective for generator:
For the optimal discriminator
$$
$$
Btw, there are other divergences that we can use than Jenson-Shannon Divergence.
GAN training algorithm
Sample minibatch of
Update the generator parameters
Repeat for fixed number of epochs…or until samples look good, lol.
GANs can be very challenging to train in practice.