DDPM#
In physics and chemistry, the microscopic reversibility states that
“the microscopic detailed dynamics of particles and fields is time-reversible because the microscopic equations of motion are symmetric with respect to inversion in time”
This means that the diffusion of particles can be reversed in a microscopic level.
Assuming this principle also holds for images, we could train a neural network to learn the reverse denoising process from the diffusion process of images to noise as they are symmetric.
This is generaly what Denoising Diffusion Probabilistic Models do, they generate data by gradually denoising data starting from Gaussian noise.
Since this principle holds for “microscopic detailed dynamics”, the Forward Diffusion process is designed so that it gradually diffuses data to Gaussian noise.
In each step, we sample from a Gaussian distribution that perturbs the data. Formally, we define it as a Markov chain of Gaussians:
Diffusion models scale down the data with each forward process step (by a \(\sqrt{1-\beta_t}\) factor) so that variance does not grow when adding noise, thus providing consistently scaled inputs to the nerual net reverse process.
Note that we can sample \(\bx_t\) for an arbitrary timestep \(t\) in closed form:
\(\beta_t\) is chosen to be small enough relative to data scaled to \([-1, 1]\), this ensures we are taking microscopoic steps and \(T\) is chosen big enough so that the data is completely diffused to Gaussian noise.
Since the forward and reverese process is symmetric, the revere denoising process should also be a Markov chain of Gaussians starting from \(p(\bx_T)=\mathcal{N}(\bx_T; \bzero, \bI)\):
In order to generate data, we sample from the Standard Normal distribution then iteratively sample \(p_\theta(x_{t-1}|x_t)\). We use a discrete decoder in the final denoising step by setting the noise to zero.
For training, we optimize the variance lower bound objective from variational autoencoders.
We can reparameterize the variance lower bound into
Rewriting loss as \(L = L_T + \sum_{t\lt1}L_{t-1} + L_0\)
We parameterize the neural network to closely match the forward process in \(L_{t-1}\)
Recall that \(p_\theta(\bx_{t-1}|\bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \bSigma_\theta(\bx_t, t))\) for \({1 \lt t \leq T}\).
With \(p_\theta(\bx_{t-1} | \bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \sigma_t^2\bI)\), we can write:
Experimentally, both \(\sigma_t^2 = \beta_t\) and \(\sigma_t^2 = \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t\) had similar results.
Input image data is assumed to be integers in \({0, 1, \, ... \, ,255}\) scaled linearly to \([-1, 1]\). The last step of the reverse process is set to an independent discrete decoder. At the final step of sampling, noise is not used.
Then we can simplify the loss to
For small \(t\), \(\lambda_t\) is too large, In the paper setting \(\lambda_t = 1\) improves sample quality
Forward Process, \(q(x_t|x_{t-1})\) |
|
Reverse Denoising Process, \(p_\theta(x_{t-1}|x_t)\) |
|
Samples from a gaussian distribution using the reparameterization trick |
|
constants increasing linearly from \(10^{-4}\) to \(0.02\) |
|
Simple Loss objective \(L_\text{simple}\), MSE loss between noise and predicted noise |
|
Training and Sampling for DDPM |
|
U-Net for predicting noise in images |
|
LightningModule for training DDPM |
DDPM Training and Sampling#
- class dmme.ddpm.DDPM(model, timesteps)[source]#
Training and Sampling for DDPM
- Parameters:
timesteps (int) – total timesteps \(T\)
- training_step(x_0)[source]#
Computes loss for DDPM
- Parameters:
x_0 (torch.Tensor) – sample image to add noise and denoise for training
- Returns:
loss, \(L_\text{simple}\)
- Return type:
(torch.Tensor)
- sampling_step(x_t, t)[source]#
Denoise image by sampling from \(p_\theta(x_{t-1}|x_t)\)
- Parameters:
model (nn.Module) – model for estimating noise
x_t (torch.Tensor) – image of shape \((N, C, H, W)\)
t (torch.Tensor) – starting \(t\) to sample from, a tensor of shape \((N,)\)
- Returns:
denoised image of shape \((N, C, H, W)\)
- Return type:
(torch.Tensor)
- generate(img_size: Tuple[int, int, int, int])[source]#
Generate image of shape \((N, C, H, W)\) by running the full denoising steps
- Parameters:
img_size (Tuple[int, int, int, int]) – image size to generate as a tuple \((N, C, H, W)\)
- Returns:
generated image of shape \((N, C, H, W)\)
- Return type:
(torch.Tensor)
- dmme.ddpm.forward_process(image, alpha_bar_t, noise)[source]#
Forward Process, \(q(x_t|x_{t-1})\)
- Parameters:
image (torch.Tensor) – image of shape \((N, C, H, W)\)
alpha_bar_t (torch.Tensor) – \(\bar\alpha_t\) of shape \((N, 1, 1, *)\)
noise (torch.Tensor) – noise sampled from standard normal distribution with the same shape as the image
- dmme.ddpm.reverse_process(x_t, beta_t, alpha_t, alpha_bar_t, noise_in_x_t, variance, noise)[source]#
Reverse Denoising Process, \(p_\theta(x_{t-1}|x_t)\)
- Parameters:
beta_t (torch.Tensor) – \(\beta_t\) of shape \((N, 1, 1, *)\)
alpha_t (torch.Tensor) – \(\alpha_t\) of shape \((N, 1, 1, *)\)
alpha_bar_t (torch.Tensor) – \(\bar\alpha_t\) of shape \((N, 1, 1, *)\)
noise_in_x_t (torch.Tensor) – estimated noise in \(x_t\) predicted by a neural network
variance (torch.Tensor) – variance of the reverse process, either learned or fixed
noise (torch.Tensor) – noise sampled from \(\mathcal{N}(0, I)\)
- dmme.ddpm.sample_gaussian(mean, variance, noise)[source]#
Samples from a gaussian distribution using the reparameterization trick
- Parameters:
mean (torch.Tensor) – mean of the distribution
variance (torch.Tensor) – variance of the distribution
noise (torch.Tensor) – noise sampled from \(\mathcal{N}(0, I)\)
U-Net for estimating noise in images#
U-Net for predicting noise in images |
|
Transformer position encoding |
|
3x3 basic resblocks with group norm, dropout and timestep embeddings |
|
Downsample blocks |
|
Upsample blocks |
|
Self Attention with groupnorm |
- class dmme.ddpm.UNet(in_channels, pos_dim=128, emb_dim=512, num_groups=32, dropout=0.1, channels_per_depth=(128, 256, 256, 256), num_blocks=2, attention_depths=(2,))[source]#
U-Net for predicting noise in images
- Parameters:
in_channels (int) – input channels of image
pos_dim (int) – dimension of position embedding
emb_dim (int) – dimension of timestep embedding
num_groups (int) – number of groups in
nn.GroupNormdropout (float) – dropout rate in
nn.Dropout2dchannels_per_depth (Tuple[int, ...]) – channels per depth
num_blocks (int) – number of resblocks to use in each depth
attention_depths (Tuple[int, ...]) – depths to use attention blocks
- class dmme.ddpm.SinusoidalPositionEmbeddings(dim)[source]#
Transformer position encoding
- Parameters:
dim (int) – number of dimensions of the position embedding, \(d_\text{emb}\)
- class dmme.ddpm.ResBlock(c_in, c_out, with_attention=False, emb_dim=512, num_groups=32, p=0.1)[source]#
3x3 basic resblocks with group norm, dropout and timestep embeddings
- Parameters:
c_in (int) – number of input channels
c_out (int) – number of output channels
with_attention (bool) – whether to add attention block
emb_dim (int) – input timestep embedding dimension
num_groups (int) – number of groups in
nn.GroupNormp (float) – dropout rate in
nn.Dropout2d
- dmme.ddpm.DownSample(c_in, c_out)[source]#
Downsample blocks
- Parameters:
c_in (int) – number of input channels
c_out (int) – number of output channels
- Returns:
down sampling layer using 2d convolutions
- Return type:
(nn.Conv2d)
- class dmme.ddpm.UpSample(c_in, c_out)[source]#
Upsample blocks
- Parameters:
c_in (int) – number of input channels
c_out (int) – number of output channels
Training Loop#
- class dmme.ddpm.LitDDPM(model: Module, lr: float = 0.0002, warmup: int = 5000, imgsize: Tuple[int, int, int] = (3, 32, 32), timesteps: int = 1000, decay: float = 0.9999)[source]#
LightningModule for training DDPM
- Parameters:
model (nn.Module) – neural network predicting noise \(\epsilon_\theta\)
lr (float) – learning rate, defaults to \(2e-4\)
warmup (int) – linearly increases learning rate for warmup steps until lr is reached, defaults to 5000
imgsize (Tuple[int, int, int]) – image size in (C, H, W)
timestpes (int) – total timesteps for the forward and reverse process, \(T\)
decay (float) – EMA decay value
- forward(x_t: Tensor, t: int)[source]#
Denoise image once using DDPM
- Parameters:
x_t (torch.Tensor) – image of shape \((N, C, H, W)\)
t (int) – starting \(t\) to sample from
noise (torch.Tensor) – noise to use for sampling, if None samples new noise
- Returns:
generated sample of shape \((N, C, H, W)\)
- Return type:
(torch.Tensor)