DDPM#

In physics and chemistry, the microscopic reversibility states that

“the microscopic detailed dynamics of particles and fields is time-reversible because the microscopic equations of motion are symmetric with respect to inversion in time”

This means that the diffusion of particles can be reversed in a microscopic level.

Assuming this principle also holds for images, we could train a neural network to learn the reverse denoising process from the diffusion process of images to noise as they are symmetric.

This is generaly what Denoising Diffusion Probabilistic Models do, they generate data by gradually denoising data starting from Gaussian noise.

Since this principle holds for “microscopic detailed dynamics”, the Forward Diffusion process is designed so that it gradually diffuses data to Gaussian noise.

In each step, we sample from a Gaussian distribution that perturbs the data. Formally, we define it as a Markov chain of Gaussians:

\[ \begin{aligned} q(\bx_{1:T} | \bx_0) &\defeq \prod_{t=1}^T q(\bx_t | \bx_{t-1} ), \qquad q(\bx_t|\bx_{t-1}) \defeq \mathcal{N}(\bx_t;\sqrt{1-\beta_t}\bx_{t-1},\beta_t \bI) \end{aligned} \]

Diffusion models scale down the data with each forward process step (by a \(\sqrt{1-\beta_t}\) factor) so that variance does not grow when adding noise, thus providing consistently scaled inputs to the nerual net reverse process.

Note that we can sample \(\bx_t\) for an arbitrary timestep \(t\) in closed form:

\[ \begin{aligned} \alpha_t &\defeq 1-\beta_t, \quad \bar\alpha_t \defeq \prod_{s=1}^t \alpha_s \\ q(\bx_t|\bx_0) &= \mathcal{N}(\bx_t; \sqrt{\bar\alpha_t}\bx_0, (1-\bar\alpha_t)\bI) \end{aligned} \]

\(\beta_t\) is chosen to be small enough relative to data scaled to \([-1, 1]\), this ensures we are taking microscopoic steps and \(T\) is chosen big enough so that the data is completely diffused to Gaussian noise.

Since the forward and reverese process is symmetric, the revere denoising process should also be a Markov chain of Gaussians starting from \(p(\bx_T)=\mathcal{N}(\bx_T; \bzero, \bI)\):

\[ \begin{aligned} p_\theta(\bx_{0:T}) &\defeq p(\bx_T)\prod_{t=1}^T p_\theta(\bx_{t-1}|\bx_t), \qquad p_\theta(\bx_{t-1}|\bx_t) \defeq \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \bSigma_\theta(\bx_t, t)) \end{aligned} \]

In order to generate data, we sample from the Standard Normal distribution then iteratively sample \(p_\theta(x_{t-1}|x_t)\). We use a discrete decoder in the final denoising step by setting the noise to zero.

For training, we optimize the variance lower bound objective from variational autoencoders.

\[ \begin{aligned} \Ea{-\log p_\theta(\bx_0)} &\leq \Eb{q}{ - \log \frac{p_\theta(\bx_{0:T})}{q(\bx_{1:T} | \bx_0)}} \\ &= \mathbb{E}_q\bigg[ -\log p(\bx_T) - \sum_{t \geq 1} \log \frac{p_\theta(\bx_{t-1} | \bx_t)}{q(\bx_t|\bx_{t-1})} \bigg] \eqqcolon L \end{aligned} \]

We can reparameterize the variance lower bound into

\[ \begin{aligned} \mathbb{E}_q \bigg[ \underbrace{\kl{q(\bx_T|\bx_0)}{p(\bx_T)}}_{L_T \, \approx \, 0} + \sum_{t > 1} \underbrace{\kl{q(\bx_{t-1}|\bx_t,\bx_0)}{p_\theta(\bx_{t-1}|\bx_t)}}_{L_{t-1}} \underbrace{-\log p_\theta(\bx_0|\bx_1)}_{L_0, \, \text{ignore}} \bigg] \end{aligned} \]

Rewriting loss as \(L = L_T + \sum_{t\lt1}L_{t-1} + L_0\)

\[ \begin{aligned} q(\bx_{t-1}|\bx_t,\bx_0) &= \mathcal{N}(\bx_{t-1}; \tilde\bmu_t(\bx_t, \bx_0), \tilde\beta_t \bI), \\ \text{where}\quad \tilde\bmu_t(\bx_t, \bx_0) &\defeq \frac{\sqrt{\bar\alpha_{t-1}}\beta_t }{1-\bar\alpha_t}\bx_0 + \frac{\sqrt{\alpha_t}(1- \bar\alpha_{t-1})}{1-\bar\alpha_t} \bx_t \quad \text{and} \quad \tilde\beta_t \defeq \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t \end{aligned} \]

We parameterize the neural network to closely match the forward process in \(L_{t-1}\)

Recall that \(p_\theta(\bx_{t-1}|\bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \bSigma_\theta(\bx_t, t))\) for \({1 \lt t \leq T}\).

With \(p_\theta(\bx_{t-1} | \bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \sigma_t^2\bI)\), we can write:

Experimentally, both \(\sigma_t^2 = \beta_t\) and \(\sigma_t^2 = \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t\) had similar results.

\[ \begin{aligned} L_{t-1} &= \mathbb{E}_q \bigg[ \frac{1}{2\sigma_t^2} \|\tilde\mu_t(x_t,x_0) - \mu_\theta(x_t, t)\|^2 \bigg] + C \\ \tilde\mu(x_t,t) &= \frac{1}{\sqrt{1-\beta_t}}\bigg(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon(x_t,t)\bigg) \\ \mu_\theta(x_t,t) &= \frac{1}{\sqrt{1-\beta_t}}\bigg(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\bigg) \end{aligned} \]

Input image data is assumed to be integers in \({0, 1, \, ... \, ,255}\) scaled linearly to \([-1, 1]\). The last step of the reverse process is set to an independent discrete decoder. At the final step of sampling, noise is not used.

Then we can simplify the loss to

\[ \begin{aligned} \E_{\bx_0, \bepsilon}\bigg[ \underbrace{\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)}}_{\lambda_t} \left\| \bepsilon - \bepsilon_\theta(\sqrt{\bar\alpha_t} \bx_0 + \sqrt{1-\bar\alpha_t}\bepsilon, t) \right\|^2 \bigg] \end{aligned} \]

For small \(t\), \(\lambda_t\) is too large, In the paper setting \(\lambda_t = 1\) improves sample quality

\[ \begin{aligned} L_\mathrm{simple} &\defeq \E_{t \sim \mathcal{U}(1, T), \bx_0, \bepsilon}\big[ \| \bepsilon - \bepsilon_\theta(\underbrace{\sqrt{\bar\alpha_t} \bx_0 + \sqrt{1-\bar\alpha_t}\bepsilon}_{\bx_t}, t) \|^2 \big] \\ \end{aligned} \]

forward_process

Forward Process, \(q(x_t|x_{t-1})\)

reverse_process

Reverse Denoising Process, \(p_\theta(x_{t-1}|x_t)\)

sample_gaussian

Samples from a gaussian distribution using the reparameterization trick

linear_schedule

constants increasing linearly from \(10^{-4}\) to \(0.02\)

simple_loss

Simple Loss objective \(L_\text{simple}\), MSE loss between noise and predicted noise

DDPM

Training and Sampling for DDPM

UNet

U-Net for predicting noise in images

LitDDPM

LightningModule for training DDPM

DDPM Training and Sampling#

class dmme.ddpm.DDPM(model, timesteps)[source]#

Training and Sampling for DDPM

Parameters:

timesteps (int) – total timesteps \(T\)

training_step(x_0)[source]#

Computes loss for DDPM

Parameters:

x_0 (torch.Tensor) – sample image to add noise and denoise for training

Returns:

loss, \(L_\text{simple}\)

Return type:

(torch.Tensor)

sampling_step(x_t, t)[source]#

Denoise image by sampling from \(p_\theta(x_{t-1}|x_t)\)

Parameters:
  • model (nn.Module) – model for estimating noise

  • x_t (torch.Tensor) – image of shape \((N, C, H, W)\)

  • t (torch.Tensor) – starting \(t\) to sample from, a tensor of shape \((N,)\)

Returns:

denoised image of shape \((N, C, H, W)\)

Return type:

(torch.Tensor)

generate(img_size: Tuple[int, int, int, int])[source]#

Generate image of shape \((N, C, H, W)\) by running the full denoising steps

Parameters:

img_size (Tuple[int, int, int, int]) – image size to generate as a tuple \((N, C, H, W)\)

Returns:

generated image of shape \((N, C, H, W)\)

Return type:

(torch.Tensor)

forward(x, t)[source]#

Predicts noise given image and timestep

dmme.ddpm.forward_process(image, alpha_bar_t, noise)[source]#

Forward Process, \(q(x_t|x_{t-1})\)

Parameters:
  • image (torch.Tensor) – image of shape \((N, C, H, W)\)

  • alpha_bar_t (torch.Tensor) – \(\bar\alpha_t\) of shape \((N, 1, 1, *)\)

  • noise (torch.Tensor) – noise sampled from standard normal distribution with the same shape as the image

dmme.ddpm.reverse_process(x_t, beta_t, alpha_t, alpha_bar_t, noise_in_x_t, variance, noise)[source]#

Reverse Denoising Process, \(p_\theta(x_{t-1}|x_t)\)

Parameters:
  • beta_t (torch.Tensor) – \(\beta_t\) of shape \((N, 1, 1, *)\)

  • alpha_t (torch.Tensor) – \(\alpha_t\) of shape \((N, 1, 1, *)\)

  • alpha_bar_t (torch.Tensor) – \(\bar\alpha_t\) of shape \((N, 1, 1, *)\)

  • noise_in_x_t (torch.Tensor) – estimated noise in \(x_t\) predicted by a neural network

  • variance (torch.Tensor) – variance of the reverse process, either learned or fixed

  • noise (torch.Tensor) – noise sampled from \(\mathcal{N}(0, I)\)

dmme.ddpm.sample_gaussian(mean, variance, noise)[source]#

Samples from a gaussian distribution using the reparameterization trick

Parameters:
  • mean (torch.Tensor) – mean of the distribution

  • variance (torch.Tensor) – variance of the distribution

  • noise (torch.Tensor) – noise sampled from \(\mathcal{N}(0, I)\)

dmme.ddpm.linear_schedule(timesteps: int, start=0.0001, end=0.02) Tensor[source]#

constants increasing linearly from \(10^{-4}\) to \(0.02\)

Parameters:
  • timesteps (int) – total timesteps

  • start (float) – starting value, defaults to 0.0001

  • end (float) – end value, defaults to 0.02

dmme.ddpm.simple_loss(noise, estimated_noise)[source]#

Simple Loss objective \(L_\text{simple}\), MSE loss between noise and predicted noise

Parameters:
  • noise (torch.Tensor) – noise used in the forward process

  • estimated_noise (torch.Tensor) – estimated noise with the same shape as noise

U-Net for estimating noise in images#

UNet

U-Net for predicting noise in images

SinusoidalPositionEmbeddings

Transformer position encoding

ResBlock

3x3 basic resblocks with group norm, dropout and timestep embeddings

DownSample

Downsample blocks

UpSample

Upsample blocks

Attention

Self Attention with groupnorm

class dmme.ddpm.UNet(in_channels, pos_dim=128, emb_dim=512, num_groups=32, dropout=0.1, channels_per_depth=(128, 256, 256, 256), num_blocks=2, attention_depths=(2,))[source]#

U-Net for predicting noise in images

Parameters:
  • in_channels (int) – input channels of image

  • pos_dim (int) – dimension of position embedding

  • emb_dim (int) – dimension of timestep embedding

  • num_groups (int) – number of groups in nn.GroupNorm

  • dropout (float) – dropout rate in nn.Dropout2d

  • channels_per_depth (Tuple[int, ...]) – channels per depth

  • num_blocks (int) – number of resblocks to use in each depth

  • attention_depths (Tuple[int, ...]) – depths to use attention blocks

forward(x, c)[source]#

Predicts noise from x

Parameters:
  • x (torch.Tensor) – image of shape \((N, C, H, W)\)

  • c (torch.Tensor) – timestep of shape \((N,)\)

Returns:

estimated noise in input image x

Return type:

(torch.Tensor)

class dmme.ddpm.SinusoidalPositionEmbeddings(dim)[source]#

Transformer position encoding

Parameters:

dim (int) – number of dimensions of the position embedding, \(d_\text{emb}\)

forward(t)[source]#
Parameters:

t (torch.Tensor) – timestep of shape \((N,)\)

Returns:

Positional Embedding of shape \((N, d_\text{emb})\)

Return type:

(torch.Tensor)

class dmme.ddpm.ResBlock(c_in, c_out, with_attention=False, emb_dim=512, num_groups=32, p=0.1)[source]#

3x3 basic resblocks with group norm, dropout and timestep embeddings

Parameters:
  • c_in (int) – number of input channels

  • c_out (int) – number of output channels

  • with_attention (bool) – whether to add attention block

  • emb_dim (int) – input timestep embedding dimension

  • num_groups (int) – number of groups in nn.GroupNorm

  • p (float) – dropout rate in nn.Dropout2d

forward(x, c)[source]#
Parameters:
  • x (torch.Tensor) – image of shape \((N, C_\text{in}, H, W)\)

  • c (torch.Tensor) – timestep embedding of shape \((N, d_\text{emb})\)

Returns:

feature map of shape \((N, C_\text{out}, H, W)\)

Return type:

(torch.Tensor)

dmme.ddpm.DownSample(c_in, c_out)[source]#

Downsample blocks

Parameters:
  • c_in (int) – number of input channels

  • c_out (int) – number of output channels

Returns:

down sampling layer using 2d convolutions

Return type:

(nn.Conv2d)

class dmme.ddpm.UpSample(c_in, c_out)[source]#

Upsample blocks

Parameters:
  • c_in (int) – number of input channels

  • c_out (int) – number of output channels

forward(x)[source]#
Parameters:

x (torch.Tensor) – image of shape \((N, C_\text{in}, H, W)\)

Returns:

downsampled feature map of shape \((N, C_\text{out}, H//2, W//2)\)

Return type:

(torch.Tensor)

class dmme.ddpm.Attention(dim, num_groups)[source]#

Self Attention with groupnorm

Parameters:
  • dim (int) – equivalent to \(d_\text{model}\)

  • num_groups (int) – number of groups in nn.GroupNorm

forward(x)[source]#
Parameters:

x (torch.Tensor) – image of shape \((N, C_\text{in}, H, W)\)

Returns:

feature maps of shape \((N, C_\text{in}, H, W)\)

Return type:

(torch.Tensor)

Training Loop#

class dmme.ddpm.LitDDPM(model: Module, lr: float = 0.0002, warmup: int = 5000, imgsize: Tuple[int, int, int] = (3, 32, 32), timesteps: int = 1000, decay: float = 0.9999)[source]#

LightningModule for training DDPM

Parameters:
  • model (nn.Module) – neural network predicting noise \(\epsilon_\theta\)

  • lr (float) – learning rate, defaults to \(2e-4\)

  • warmup (int) – linearly increases learning rate for warmup steps until lr is reached, defaults to 5000

  • imgsize (Tuple[int, int, int]) – image size in (C, H, W)

  • timestpes (int) – total timesteps for the forward and reverse process, \(T\)

  • decay (float) – EMA decay value

forward(x_t: Tensor, t: int)[source]#

Denoise image once using DDPM

Parameters:
  • x_t (torch.Tensor) – image of shape \((N, C, H, W)\)

  • t (int) – starting \(t\) to sample from

  • noise (torch.Tensor) – noise to use for sampling, if None samples new noise

Returns:

generated sample of shape \((N, C, H, W)\)

Return type:

(torch.Tensor)

training_step(batch, batch_idx)[source]#

Train model using \(L_\text{simple}\)

test_step(batch, batch_idx)[source]#

Generate samples for evaluation

generate(img_size)[source]#

Iteratively sample from \(p_\theta(x_{t-1}|x_t)\) to generate images

Parameters:

x_t (torch.Tensor) – \(x_T\) to start from

test_epoch_end(outputs)[source]#

Compute metrics and log at the end of the epoch

configure_optimizers()[source]#

Configure optimizers for training Uses Adam and warmup lr

configure_callbacks()[source]#

Configure EMA callback, will override any other EMA callback