DDPM#

In physics and chemistry, the microscopic reversibility states that

“the microscopic detailed dynamics of particles and fields is time-reversible because the microscopic equations of motion are symmetric with respect to inversion in time”

This mean that if a data distribution is diffused to noise, the reverse process exists in a microscopic level.

This is because the equations that describe the dynamics are “symmetric with respect to inversion in time”.

Assuming this reverse process exists, the Denoising Diffusion Probabilistic Model generates data by gradually denoising data starting from Gaussian noise.

Since this principle holds for “microscopic detailed dynamics”, we design a Forward Diffusion process that gradually diffuses data to Gaussian noise.

In each step, we sample from a Gaussian distribution that perturbs the data. Formally, we define it as a Markov chain of Gaussians:

\[ \begin{aligned} q(\bx_{1:T} | \bx_0) &\defeq \prod_{t=1}^T q(\bx_t | \bx_{t-1} ), \qquad q(\bx_t|\bx_{t-1}) \defeq \mathcal{N}(\bx_t;\sqrt{1-\beta_t}\bx_{t-1},\beta_t \bI) \end{aligned} \]

Note that we can sample \(\bx_t\) for an arbitrary timestep $t# in closed form:

\[ \begin{aligned} \alpha_t &\defeq 1-\beta_t, \quad \bar\alpha_t \defeq \prod_{s=1}^t \alpha_s \\ q(\bx_t|\bx_0) &= \mathcal{N}(\bx_t; \sqrt{\bar\alpha_t}\bx_0, (1-\bar\alpha_t)\bI) \end{aligned} \]

If \(\beta_t\) is small enough, the reverse process should also exist. And since the process is symmetric it should also be a Markov chain of Gaussians starting from \(p(\bx_T)=\mathcal{N}(\bx_T; \bzero, \bI)\):

\[ \begin{aligned} p_\theta(\bx_{0:T}) &\defeq p(\bx_T)\prod_{t=1}^T p_\theta(\bx_{t-1}|\bx_t), \qquad p_\theta(\bx_{t-1}|\bx_t) \defeq \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \bSigma_\theta(\bx_t, t)) \end{aligned} \]

In order to generate data, we sample from the Standard Normal distribution then iteratively sample \(p_\theta(x_{t-1}|x_t)\)

For training, we optimize the variance lower bound objective from variational autoencoders.

\[ \begin{aligned} \Ea{-\log p_\theta(\bx_0)} &\leq \Eb{q}{ - \log \frac{p_\theta(\bx_{0:T})}{q(\bx_{1:T} | \bx_0)}} \\ &= \mathbb{E}_q\bigg[ -\log p(\bx_T) - \sum_{t \geq 1} \log \frac{p_\theta(\bx_{t-1} | \bx_t)}{q(\bx_t|\bx_{t-1})} \bigg] \eqqcolon L \end{aligned} \]

We can reparameterize the variance lower bound into

\[ \begin{aligned} \mathbb{E}_q \bigg[ \underbrace{\kl{q(\bx_T|\bx_0)}{p(\bx_T)}}_{L_T \, \approx \, 0} + \sum_{t > 1} \underbrace{\kl{q(\bx_{t-1}|\bx_t,\bx_0)}{p_\theta(\bx_{t-1}|\bx_t)}}_{L_{t-1}} \underbrace{-\log p_\theta(\bx_0|\bx_1)}_{L_0, \, \text{ignore}} \bigg] \end{aligned} \]

Rewriting loss as \(L = L_T + \sum_{t\lt1}L_{t-1} + L_0\)

\[ \begin{aligned} q(\bx_{t-1}|\bx_t,\bx_0) &= \mathcal{N}(\bx_{t-1}; \tilde\bmu_t(\bx_t, \bx_0), \tilde\beta_t \bI), \\ \text{where}\quad \tilde\bmu_t(\bx_t, \bx_0) &\defeq \frac{\sqrt{\bar\alpha_{t-1}}\beta_t }{1-\bar\alpha_t}\bx_0 + \frac{\sqrt{\alpha_t}(1- \bar\alpha_{t-1})}{1-\bar\alpha_t} \bx_t \quad \text{and} \quad \tilde\beta_t \defeq \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t \end{aligned} \]

We parameterize the neural network to closely match the forward process in \(L_{t-1}\)

Recall that \(p_\theta(\bx_{t-1}|\bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \bSigma_\theta(\bx_t, t))\) for \({1 \lt t \leq T}\).

With \(p_\theta(\bx_{t-1} | \bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \sigma_t^2\bI)\), we can write:

Experimentally, both \(\sigma_t^2 = \beta_t\) and \(\sigma_t^2 = \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t\) had similar results.

\[ \begin{aligned} L_{t-1} &= \mathbb{E}_q \bigg[ \frac{1}{2\sigma_t^2} \|\tilde\mu_t(x_t,x_0) - \mu_\theta(x_t, t)\|^2 \bigg] + C \\ \tilde\mu(x_t,t) &= \frac{1}{\sqrt{1-\beta_t}}\bigg(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon(x_t,t)\bigg) \\ \mu_\theta(x_t,t) &= \frac{1}{\sqrt{1-\beta_t}}\bigg(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\bigg) \end{aligned} \]

Input image data is assumed to be integers in \({0, 1, \, ... \, ,255}\) scaled linearly to \([-1, 1]\). The last step of the reverse process is set to an independent discrete decoder. At the final step of sampling, noise is not used.

Then we can simplify the loss to

\[ \begin{aligned} \E_{\bx_0, \bepsilon}\bigg[ \underbrace{\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)}}_{\lambda_t} \left\| \bepsilon - \bepsilon_\theta(\sqrt{\bar\alpha_t} \bx_0 + \sqrt{1-\bar\alpha_t}\bepsilon, t) \right\|^2 \bigg] \end{aligned} \]

For small \(t\), \(\lambda_t\) is too large, In the paper setting \(\lambda_t = 1\) improves sample quality

\[ \begin{aligned} L_\mathrm{simple} &\defeq \E_{t \sim \mathcal{U}(1, T), \bx_0, \bepsilon}\big[ \| \bepsilon - \bepsilon_\theta(\underbrace{\sqrt{\bar\alpha_t} \bx_0 + \sqrt{1-\bar\alpha_t}\bepsilon}_{\bx_t}, t) \|^2 \big] \\ \end{aligned} \]

dmme.DDPMSampler

Wrapper for computing forward and reverse processes, sampling data, and computing loss for DDPM

dmme.ddpm.linear_schedule

constants increasing linearly from \(10^{-4}\) to \(0.02\)

dmme.ddpm.UNet

UNet with GroupNorm and Attention, Predicts noise from \(x_t\) and \(t\)

dmme.LitDDPM

LightningModule for training DDPM

Sampler#

class dmme.DDPMSampler(model: Module, timesteps: int)[source]#

Wrapper for computing forward and reverse processes, sampling data, and computing loss for DDPM

Paper: https://arxiv.org/abs/2006.11239

Code: https://github.com/hojonathanho/diffusion

Parameters:
  • model (nn.Module) – model

  • timesteps (int) – diffusion timesteps

forward_process(x_0, t, noise=None)[source]#

Forward Diffusion Process

Samples \(x_t\) from \(q(x_t|x_0) = \mathcal{N}(x_t;\sqrt{\bar\alpha_t}\bold{x}_0,(1-\bar\alpha_t)\bold{I})\)

Computes \(\bold{x}_t = \sqrt{\bar\alpha_t}\bold{x}_0 + \sqrt{1-\bar\alpha_t}\bold{I}\)

Parameters:
  • x_0 (torch.Tensor) – data to add noise to

  • t (int) – \(t\) in \(x_t\)

  • noise (torch.Tensor, optional) – \(\epsilon\), noise used in the forward process

Returns:

\(\bold{x}_t \sim q(\bold{x}_t|\bold{x}_0)\)

Return type:

(torch.Tensor)

noise_schedule()[source]#

Noise Schedule for DDPM

DDPM sets \(T = 1000\) and linearly increases \(\beta_t\) from \(10^{-4}\) to \(0.02\)

Returns:

\(\beta_1, \, ... \, ,\beta_T\) as a tensor of shape \((T,)\)

Return type:

(torch.Tensor)

register_alphas(beta)[source]#

Caches \(\alpha_t\) used in the forward and reverse process

\(\alpha_t\) is constant so we register them in nn.Module’s buffers

Parameters:

beta (torch.Tensor) – beta values to use to compute alphas, a tensor of shape \((T,)\)

reverse_process(x_t, t, noise=None)[source]#

Reverse Denoising Process

Samples \(x_{t-1}\) from \(p_\theta(\bold{x}_{t-1}|\bold{x}_t) = \mathcal{N}(\bold{x}_{t-1};\mu_\theta(\bold{x}_t, t), \sigma_t\bold{I})\)

\[\begin{aligned} \bold\mu_\theta(\bold{x}_t, t) &= \frac{1}{\sqrt{\alpha_t}}\bigg(\bold{x}_t -\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\bold{x}_t,t)\bigg) \\ \sigma_t &= \beta_t \end{aligned} \]

Computes \(\bold{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\bigg(\bold{x}_t -\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\bold{x}_t,t)\bigg) +\sigma_t\epsilon\)

Parameters:
  • x_t (torch.Tensor) – x_t

  • t (int) – current timestep

  • noise (torch.Tensor) – noise

compute_loss(x_0, t=None, noise=None)[source]#

Computes the loss

\(L_\text{simple} = \mathbb{E}_{\bold{x}_0\sim q(\bold{x}_0), \epsilon\sim\mathcal{N}(\bold{0},\bold{I}), t\sim\mathcal{U}(1,T)} \left[\|\epsilon-\epsilon_\theta(\bold{x}_t, t) \|^2\right]\)

Parameters:
  • x_0 (torch.Tensor) – \(x_0\)

  • t (int, optional) – sampled \(t\)

  • noise (torch.Tensor, optional) – sampled \(\epsilon\)

sample(x_shape, start=0, end=None, step=1, save_last=True, device=None)[source]#

Generate Samples

Iteratively sample from \(p_\theta(x_t,t)\)

x = gaussian()
for t in range(T, 0, -1):
    z = gaussian() if t > 1 else 0
    x = reverse_process(x, t, z)
return x

start, end, step parameters specify which steps to return

Parameters:
  • x_shape (Tuple[int, int, int]) – image shape

  • start (int) – start t

  • end (int) – end t

  • step (int) – step

  • save_last (bool) – whether to save last sample

  • device (torch.device) – device

Returns:

denoised samples

Return type:

(List[torch.Tensor])

forward(x, t)[source]#

Predicts the noise given \(x_t\) and \(t\)

Applies forward to the internal model

Expects \(x_t\) to have shape \((N, C, H, W)\)

Parameters:
  • x (torch.Tensor) – image

  • t (int) – \(t\) in \(\bold{x}_t\)

dmme.ddpm.linear_schedule(timesteps, start=0.0001, end=0.02)[source]#

constants increasing linearly from \(10^{-4}\) to \(0.02\)

Parameters:
  • timesteps (int) – total timesteps

  • start (float) – starting value, defaults to 0.0001

  • end (float) – end value, defaults to 0.02

Model#

class dmme.ddpm.UNet(in_channels=3, dim=128, pos_dim=128, emb_dim=512, multipliers=(1, 2, 2, 2), attn_depth=(2,), groups=32, dropout=0.1)[source]#

UNet with GroupNorm and Attention, Predicts noise from \(x_t\) and \(t\)

Parameters:
  • in_channels (int) – input image channels

  • dim (int) – initial dim

  • pos_dim (int) – sinusoidal position encoding dim

  • emb_dim (int) – time embedding mlp dim

  • multipliers (Tuple[int...]) – list of channel multipliers

  • attn_depth (Tuple[int...]) – depth where attention is applied

  • groups (int) – number of groups in nn.GroupNorm

  • dropout (float) – dropout in ResBlock

forward(x, t)[source]#

Using timestep embeddings, predict noise to denoise \(x_t\) from \(x_t\) and \(t\) using a UNet

Parameters:
  • x (torch.Tensor) – \(x_t\), tensor of shape \((N, C, H, W)\)

  • t (int) – \(t\)

Returns:

\(\epsilon_\theta(x_t,t)\) predicted noise from image, a tensor of shape \((N, C, H, W)\)

Return type:

(torch.Tensor)

TimeStepEmbedding([pos_dim, emb_dim])

Timestep embedding network

SinusoidalPositionEmbeddings(dim)

Transformer position embedding

Block(in_channels, out_channels, emb_dim, ...)

Convolutional Block with multiple resblocks

DownSample(dim)

Downsampling layer

UpSample(dim, scale_factor)

Upsampling layer

Attention(dim[, groups])

Multi Head Self Attention layer

ResBlock(in_channels, out_channels, emb_dim)

ResBlock for UNet

conv2d(in_channels, out_channels, ...[, ...])

convolution layer builder with normalization and activation

class dmme.ddpm.unet.TimeStepEmbedding(pos_dim=64, emb_dim=256)[source]#

Timestep embedding network

Parameters:
  • pos_dim (int) – sinusoidal position encoding dim

  • emb_dim (int) – time embedding mlp dim

forward(t)[source]#

Encode \(t\) into Sinusoidal Position Embedding then use mlps to create timestep embeddings

Parameters:

t (torch.Tensor) – timestep as a tensor of shape \((N, 1)\)

Returns:

embedding of shape \((N, 1)\)

Return type:

(torch.Tensor)

class dmme.ddpm.unet.SinusoidalPositionEmbeddings(dim)[source]#

Transformer position embedding

Parameters:

dim (int) – dim

forward(time)[source]#

Encode time \(t\) as a Sinusoidal Position Embedding

Parameters:

time (torch.Tensor) – \(t\), a tensor of shape \((N, 1)\)

Returns:

position embedding of shape \((N, T)\)

Return type:

(torch.Tensor)

class dmme.ddpm.unet.Block(in_channels, out_channels, emb_dim, groups, dropout, num_blocks=2, add_attention=False)[source]#

Convolutional Block with multiple resblocks

Parameters:
  • in_channels (int) – number of input channels

  • out_channels (int) – number of output channels

  • emb_dim (int) – time embedding dim

  • groups (int) – num groups in nn.GroupNorm

  • dropout (float) – dropout used in ResBlock

  • num_blocks (int) – number of resblokcs used

  • add_attention (bool) – whether to add attention to the final layer

forward(x, t)[source]#

Apply multiple ResBlocks with optional Attention at the end

Parameters:
  • x (torch.Tensor) – \(x_t\), tensor of shape \((N, C, H, W)\)

  • t (int) – \(t\)

Returns:

tensor of shape \((N, C, H, W)\) where \(C\) is out_channels

Return type:

x (torch.Tensor)

class dmme.ddpm.unet.DownSample(dim)[source]#

Downsampling layer

Parameters:

dim (int) – number of input and output channels

forward(x)[source]#

Downsample by a factor of 2 using convolutions

Returns:

x

class dmme.ddpm.unet.UpSample(dim, scale_factor)[source]#

Upsampling layer

Parameters:
  • dim (int) – number of input and output channels

  • scale_factor (float) – upsample scale

forward(x)[source]#

Upsample by an arbitrary factor by upsampling with interpolation followed by 3x3 convolutions with same input and output channels

Returns:

x

class dmme.ddpm.unet.Attention(dim, groups=8)[source]#

Multi Head Self Attention layer

Parameters:
  • dim (int) – \(d_\text{model}\)

  • groups (int) – num groups in nn.GroupNorm

forward(x)[source]#

Multi Head Self Attention on images with prenorm and residual connections

Returns:

x

class dmme.ddpm.unet.ResBlock(in_channels, out_channels, emb_dim, groups=8, dropout=0.0)[source]#

ResBlock for UNet

Parameters:
  • in_channels (int) – number of input channels

  • out_channels (int) – number of output channels

  • emb_dim (int) – timestep embedding dim

  • groups (int) – num groups in nn.GroupNorm

  • dropout (float) – dropout applied in each conv

forward(x, t)[source]#

ResBlock with time embeddings

ResBlock with two convolution layers with residual connections. Adds time embedding to the first layer’s output using an mlp to match dimensions. Then normalization, activation , dropout is applied in that order. The second convolutional layer is identical to the basic resblock.

Parameters:
  • x (torch.Tensor) – \(x_t\), tensor of shape \((N, C, H, W)\)

  • t (int) – \(t\)

Returns:

tensor of shape \((N, C, H, W)\) where \(C\) is out_channels

Return type:

x (torch.Tensor)

property norm#

Returns copies of normalizaiton layers

property act#

Returns copies of activation layers

dmme.ddpm.unet.conv2d(in_channels, out_channels, kernel_size, padding, norm=None, act=None)[source]#

convolution layer builder with normalization and activation

Parameters:
  • in_channels (int) – number of input channels

  • out_channels (int) – number of output channels

  • kernel_size (int) – kernel size

  • padding (int) – padding

  • norm (nn.Module) – normalization layer instance

  • act (nn.Module) – activation function instance

Training#

class dmme.LitDDPM(sampler: Module, lr: float = 0.0002, warmup: int = 5000, imgsize: Tuple[int, int, int] = (3, 32, 32), timesteps: int = 1000, decay: float = 0.9999)[source]#

LightningModule for training DDPM

Parameters:
  • sampler (nn.Module) – an instance of DDPMSampler

  • lr (float) – learning rate, defaults to \(2e-4\)

  • warmup (int) – linearly increases learning rate for warmup steps until lr is reached, defaults to 5000

  • imgsize (Tuple[int, int, int]) – image size in (C, H, W)

  • timestpes (int) – total timesteps for the forward and reverse process, \(T\)

  • decay (float) – EMA decay value

training_step(batch, batch_idx)[source]#

Compute loss using sampler

training_epoch_end(outputs)[source]#

Generate samples at the end of the epoch

test_step(batch, batch_idx)[source]#

Generate samples for evaluation

test_epoch_end(outputs)[source]#

Compute metrics and log at the end of the epoch

configure_optimizers()[source]#

Configure optimizers for training Uses Adam and warmup lr

setup(stage: str)[source]#

Prepare metrics for test stage

configure_callbacks()[source]#

Configure EMA callback, will override any other EMA callback

sample_and_log(num_samples=1, length=1)[source]#

Sample data and log to logger

Parameters:
  • num_samples (int) – number of samples

  • length (int) – length of history to save in \(T\) timesteps