DDPM#

In physics and chemistry, the microscopic reversibility states that

“the microscopic detailed dynamics of particles and fields is time-reversible because the microscopic equations of motion are symmetric with respect to inversion in time”

This mean that if a data distribution is diffused to noise, the reverse process exists in a microscopic level.

This is because the equations that describe the dynamics are “symmetric with respect to inversion in time”.

Assuming this reverse process exists, the Denoising Diffusion Probabilistic Model generates data by gradually denoising data starting from Gaussian noise.

Since this principle holds for “microscopic detailed dynamics”, we design a Forward Diffusion process that gradually diffuses data to Gaussian noise.

In each step, we sample from a Gaussian distribution that perturbs the data. Formally, we define it as a Markov chain of Gaussians:

\[ \begin{aligned} q(\bx_{1:T} | \bx_0) &\defeq \prod_{t=1}^T q(\bx_t | \bx_{t-1} ), \qquad q(\bx_t|\bx_{t-1}) \defeq \mathcal{N}(\bx_t;\sqrt{1-\beta_t}\bx_{t-1},\beta_t \bI) \end{aligned} \]

Note that we can sample $\bx_t$ for an arbitrary timestep $t# in closed form:

\[ \begin{aligned} \alpha_t &\defeq 1-\beta_t, \quad \bar\alpha_t \defeq \prod_{s=1}^t \alpha_s \\ q(\bx_t|\bx_0) &= \mathcal{N}(\bx_t; \sqrt{\bar\alpha_t}\bx_0, (1-\bar\alpha_t)\bI) \end{aligned} \]

If $\beta_t$ is small enough, the reverse process should also exist. And since the process is symmetric it should also be a Markov chain of Gaussians starting from $p(\bx_T)=\mathcal{N}(\bx_T; \bzero, \bI)$:

\[ \begin{aligned} p_\theta(\bx_{0:T}) &\defeq p(\bx_T)\prod_{t=1}^T p_\theta(\bx_{t-1}|\bx_t), \qquad p_\theta(\bx_{t-1}|\bx_t) \defeq \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \bSigma_\theta(\bx_t, t)) \end{aligned} \]

In order to generate data, we sample from the Standard Normal distribution then iteratively sample $p_\theta(x_{t-1}|x_t)$

For training, we optimize the variance lower bound objective from variational autoencoders.

\[ \begin{aligned} \Ea{-\log p_\theta(\bx_0)} &\leq \Eb{q}{ - \log \frac{p_\theta(\bx_{0:T})}{q(\bx_{1:T} | \bx_0)}} \\ &= \mathbb{E}_q\bigg[ -\log p(\bx_T) - \sum_{t \geq 1} \log \frac{p_\theta(\bx_{t-1} | \bx_t)}{q(\bx_t|\bx_{t-1})} \bigg] \eqqcolon L \end{aligned} \]

We can reparameterize the variance lower bound into

\[ \begin{aligned} \mathbb{E}_q \bigg[ \underbrace{\kl{q(\bx_T|\bx_0)}{p(\bx_T)}}_{L_T \, \approx \, 0} + \sum_{t > 1} \underbrace{\kl{q(\bx_{t-1}|\bx_t,\bx_0)}{p_\theta(\bx_{t-1}|\bx_t)}}_{L_{t-1}} \underbrace{-\log p_\theta(\bx_0|\bx_1)}_{L_0, \, \text{ignore}} \bigg] \end{aligned} \]

Rewriting loss as $L = L_T + \sum_{t\lt1}L_{t-1} + L_0$

\[ \begin{aligned} q(\bx_{t-1}|\bx_t,\bx_0) &= \mathcal{N}(\bx_{t-1}; \tilde\bmu_t(\bx_t, \bx_0), \tilde\beta_t \bI), \\ \text{where}\quad \tilde\bmu_t(\bx_t, \bx_0) &\defeq \frac{\sqrt{\bar\alpha_{t-1}}\beta_t }{1-\bar\alpha_t}\bx_0 + \frac{\sqrt{\alpha_t}(1- \bar\alpha_{t-1})}{1-\bar\alpha_t} \bx_t \quad \text{and} \quad \tilde\beta_t \defeq \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t \end{aligned} \]

We parameterize the neural network to closely match the forward process in $L_{t-1}$

Recall that $p_\theta(\bx_{t-1}|\bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \bSigma_\theta(\bx_t, t))$ for ${1 \lt t \leq T}$.

With $p_\theta(\bx_{t-1} | \bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \sigma_t^2\bI)$, we can write:

Experimentally, both $\sigma_t^2 = \beta_t$ and $\sigma_t^2 = \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t$ had similar results.

\[ \begin{aligned} L_{t-1} &= \mathbb{E}_q \bigg[ \frac{1}{2\sigma_t^2} \|\tilde\mu_t(x_t,x_0) - \mu_\theta(x_t, t)\|^2 \bigg] + C \\ \tilde\mu(x_t,t) &= \frac{1}{\sqrt{1-\beta_t}}\bigg(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon(x_t,t)\bigg) \\ \mu_\theta(x_t,t) &= \frac{1}{\sqrt{1-\beta_t}}\bigg(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\bigg) \end{aligned} \]

Input image data is assumed to be integers in ${0, 1, \, ... \, ,255}$ scaled linearly to $[-1, 1]$. The last step of the reverse process is set to an independent discrete decoder. At the final step of sampling, noise is not used.

Then we can simplify the loss to

\[ \begin{aligned} \E_{\bx_0, \bepsilon}\bigg[ \underbrace{\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)}}_{\lambda_t} \left\| \bepsilon - \bepsilon_\theta(\sqrt{\bar\alpha_t} \bx_0 + \sqrt{1-\bar\alpha_t}\bepsilon, t) \right\|^2 \bigg] \end{aligned} \]

For small $t$, $\lambda_t$ is too large, In the paper setting $\lambda_t = 1$ improves sample quality

\[ \begin{aligned} L_\mathrm{simple} &\defeq \E_{t \sim \mathcal{U}(1, T), \bx_0, \bepsilon}\big[ \| \bepsilon - \bepsilon_\theta(\underbrace{\sqrt{\bar\alpha_t} \bx_0 + \sqrt{1-\bar\alpha_t}\bepsilon}_{\bx_t}, t) \|^2 \big] \\ \end{aligned} \]

`dmme.DDPMSampler`	Wrapper for computing forward and reverse processes, sampling data, and computing loss for DDPM
`dmme.ddpm.linear_schedule`	constants increasing linearly from $10^{-4}$ to $0.02$
`dmme.ddpm.UNet`	UNet with GroupNorm and Attention, Predicts noise from $x_t$ and $t$
`dmme.LitDDPM`	LightningModule for training DDPM

Sampler#

class dmme.DDPMSampler(model: Module, timesteps: int)[source]#

Wrapper for computing forward and reverse processes, sampling data, and computing loss for DDPM

Paper: https://arxiv.org/abs/2006.11239

Code: https://github.com/hojonathanho/diffusion

Parameters:

model (nn.Module) – model
timesteps (int) – diffusion timesteps

forward_process(x_0, t, noise=None)[source]#

Forward Diffusion Process

Samples $x_t$ from $q(x_t|x_0) = \mathcal{N}(x_t;\sqrt{\bar\alpha_t}\bold{x}_0,(1-\bar\alpha_t)\bold{I})$

Computes $\bold{x}_t = \sqrt{\bar\alpha_t}\bold{x}_0 + \sqrt{1-\bar\alpha_t}\bold{I}$

Parameters:

x_0 (torch.Tensor) – data to add noise to
t (int) – $t$ in $x_t$
noise (torch.Tensor, optional) – $\epsilon$, noise used in the forward process

Returns:

$\bold{x}_t \sim q(\bold{x}_t|\bold{x}_0)$

Return type:

(torch.Tensor)

noise_schedule()[source]#

Noise Schedule for DDPM

DDPM sets $T = 1000$ and linearly increases $\beta_t$ from $10^{-4}$ to $0.02$

Returns:: $\beta_1, \, ... \, ,\beta_T$ as a tensor of shape $(T,)$
Return type:: (torch.Tensor)

register_alphas(beta)[source]#

Caches $\alpha_t$ used in the forward and reverse process

$\alpha_t$ is constant so we register them in nn.Module’s buffers

Parameters:: beta (torch.Tensor) – beta values to use to compute alphas, a tensor of shape $(T,)$

reverse_process(x_t, t, noise=None)[source]#

Reverse Denoising Process

Samples $x_{t-1}$ from $p_\theta(\bold{x}_{t-1}|\bold{x}_t) = \mathcal{N}(\bold{x}_{t-1};\mu_\theta(\bold{x}_t, t), \sigma_t\bold{I})$

\[\begin{aligned} \bold\mu_\theta(\bold{x}_t, t) &= \frac{1}{\sqrt{\alpha_t}}\bigg(\bold{x}_t -\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\bold{x}_t,t)\bigg) \\ \sigma_t &= \beta_t \end{aligned} \]

Computes $\bold{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\bigg(\bold{x}_t -\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\bold{x}_t,t)\bigg) +\sigma_t\epsilon$

Parameters:

x_t (torch.Tensor) – x_t
t (int) – current timestep
noise (torch.Tensor) – noise

compute_loss(x_0, t=None, noise=None)[source]#

Computes the loss

$L_\text{simple} = \mathbb{E}_{\bold{x}_0\sim q(\bold{x}_0), \epsilon\sim\mathcal{N}(\bold{0},\bold{I}), t\sim\mathcal{U}(1,T)} \left[\|\epsilon-\epsilon_\theta(\bold{x}_t, t) \|^2\right]$

Parameters:

x_0 (torch.Tensor) – $x_0$
t (int, optional) – sampled $t$
noise (torch.Tensor, optional) – sampled $\epsilon$

sample(x_shape, start=0, end=None, step=1, save_last=True, device=None)[source]#

Generate Samples

Iteratively sample from $p_\theta(x_t,t)$

x = gaussian()
for t in range(T, 0, -1):
    z = gaussian() if t > 1 else 0
    x = reverse_process(x, t, z)
return x

start, end, step parameters specify which steps to return

Parameters:

x_shape (Tuple[int, int, int]) – image shape
start (int) – start t
end (int) – end t
step (int) – step
save_last (bool) – whether to save last sample
device (torch.device) – device

Returns:

denoised samples

Return type:

(List[torch.Tensor])

forward(x, t)[source]#

Predicts the noise given $x_t$ and $t$

Applies forward to the internal model

Expects $x_t$ to have shape $(N, C, H, W)$

Parameters:

x (torch.Tensor) – image
t (int) – $t$ in $\bold{x}_t$

dmme.ddpm.linear_schedule(timesteps, start=0.0001, end=0.02)[source]#

constants increasing linearly from $10^{-4}$ to $0.02$

Parameters:

timesteps (int) – total timesteps
start (float) – starting value, defaults to 0.0001
end (float) – end value, defaults to 0.02

Model#

class dmme.ddpm.UNet(in_channels=3, dim=128, pos_dim=128, emb_dim=512, multipliers=(1, 2, 2, 2), attn_depth=(2,), groups=32, dropout=0.1)[source]#

UNet with GroupNorm and Attention, Predicts noise from $x_t$ and $t$

Parameters:

in_channels (int) – input image channels
dim (int) – initial dim
pos_dim (int) – sinusoidal position encoding dim
emb_dim (int) – time embedding mlp dim
multipliers (Tuple[int...]) – list of channel multipliers
attn_depth (Tuple[int...]) – depth where attention is applied
groups (int) – number of groups in nn.GroupNorm
dropout (float) – dropout in ResBlock

forward(x, t)[source]#

Using timestep embeddings, predict noise to denoise $x_t$ from $x_t$ and $t$ using a UNet

Parameters:

x (torch.Tensor) – $x_t$, tensor of shape $(N, C, H, W)$
t (int) – $t$

Returns:

$\epsilon_\theta(x_t,t)$ predicted noise from image, a tensor of shape $(N, C, H, W)$

Return type:

(torch.Tensor)

`TimeStepEmbedding`([pos_dim, emb_dim])	Timestep embedding network
`SinusoidalPositionEmbeddings`(dim)	Transformer position embedding
`Block`(in_channels, out_channels, emb_dim, ...)	Convolutional Block with multiple resblocks
`DownSample`(dim)	Downsampling layer
`UpSample`(dim, scale_factor)	Upsampling layer
`Attention`(dim[, groups])	Multi Head Self Attention layer
`ResBlock`(in_channels, out_channels, emb_dim)	ResBlock for UNet
`conv2d`(in_channels, out_channels, ...[, ...])	convolution layer builder with normalization and activation

class dmme.ddpm.unet.TimeStepEmbedding(pos_dim=64, emb_dim=256)[source]#

Timestep embedding network

Parameters:

pos_dim (int) – sinusoidal position encoding dim
emb_dim (int) – time embedding mlp dim

forward(t)[source]#

Encode $t$ into Sinusoidal Position Embedding then use mlps to create timestep embeddings

Parameters:: t (torch.Tensor) – timestep as a tensor of shape $(N, 1)$
Returns:: embedding of shape $(N, 1)$
Return type:: (torch.Tensor)

class dmme.ddpm.unet.SinusoidalPositionEmbeddings(dim)[source]#

Transformer position embedding

Parameters:: dim (int) – dim

forward(time)[source]#

Encode time $t$ as a Sinusoidal Position Embedding

Parameters:: time (torch.Tensor) – $t$, a tensor of shape $(N, 1)$
Returns:: position embedding of shape $(N, T)$
Return type:: (torch.Tensor)

class dmme.ddpm.unet.Block(in_channels, out_channels, emb_dim, groups, dropout, num_blocks=2, add_attention=False)[source]#

Convolutional Block with multiple resblocks

Parameters:

in_channels (int) – number of input channels
out_channels (int) – number of output channels
emb_dim (int) – time embedding dim
groups (int) – num groups in nn.GroupNorm
dropout (float) – dropout used in ResBlock
num_blocks (int) – number of resblokcs used
add_attention (bool) – whether to add attention to the final layer

forward(x, t)[source]#

Apply multiple ResBlocks with optional Attention at the end

Parameters:

x (torch.Tensor) – $x_t$, tensor of shape $(N, C, H, W)$
t (int) – $t$

Returns:

tensor of shape $(N, C, H, W)$ where $C$ is out_channels

Return type:

x (torch.Tensor)

class dmme.ddpm.unet.DownSample(dim)[source]#

Downsampling layer

Parameters:: dim (int) – number of input and output channels

forward(x)[source]#

Downsample by a factor of 2 using convolutions

Returns:: x

class dmme.ddpm.unet.UpSample(dim, scale_factor)[source]#

Upsampling layer

Parameters:

dim (int) – number of input and output channels
scale_factor (float) – upsample scale

forward(x)[source]#

Upsample by an arbitrary factor by upsampling with interpolation followed by 3x3 convolutions with same input and output channels

Returns:: x

class dmme.ddpm.unet.Attention(dim, groups=8)[source]#

Multi Head Self Attention layer

Parameters:

dim (int) – $d_\text{model}$
groups (int) – num groups in nn.GroupNorm

forward(x)[source]#

Multi Head Self Attention on images with prenorm and residual connections

Returns:: x

class dmme.ddpm.unet.ResBlock(in_channels, out_channels, emb_dim, groups=8, dropout=0.0)[source]#

ResBlock for UNet

Parameters:

in_channels (int) – number of input channels
out_channels (int) – number of output channels
emb_dim (int) – timestep embedding dim
groups (int) – num groups in nn.GroupNorm
dropout (float) – dropout applied in each conv

forward(x, t)[source]#

ResBlock with time embeddings

ResBlock with two convolution layers with residual connections. Adds time embedding to the first layer’s output using an mlp to match dimensions. Then normalization, activation , dropout is applied in that order. The second convolutional layer is identical to the basic resblock.

Parameters:

x (torch.Tensor) – $x_t$, tensor of shape $(N, C, H, W)$
t (int) – $t$

Returns:

tensor of shape $(N, C, H, W)$ where $C$ is out_channels

Return type:

x (torch.Tensor)

property norm#: Returns copies of normalizaiton layers

property act#: Returns copies of activation layers

dmme.ddpm.unet.conv2d(in_channels, out_channels, kernel_size, padding, norm=None, act=None)[source]#

convolution layer builder with normalization and activation

Parameters:

in_channels (int) – number of input channels
out_channels (int) – number of output channels
kernel_size (int) – kernel size
padding (int) – padding
norm (nn.Module) – normalization layer instance
act (nn.Module) – activation function instance

Training#

class dmme.LitDDPM(sampler: Module, lr: float = 0.0002, warmup: int = 5000, imgsize: Tuple[int, int, int] = (3, 32, 32), timesteps: int = 1000, decay: float = 0.9999)[source]#

LightningModule for training DDPM

Parameters:

sampler (nn.Module) – an instance of DDPMSampler
lr (float) – learning rate, defaults to $2e-4$
warmup (int) – linearly increases learning rate for warmup steps until lr is reached, defaults to 5000
imgsize (Tuple[int, int, int]) – image size in (C, H, W)
timestpes (int) – total timesteps for the forward and reverse process, $T$
decay (float) – EMA decay value

training_step(batch, batch_idx)[source]#: Compute loss using sampler

training_epoch_end(outputs)[source]#: Generate samples at the end of the epoch

test_step(batch, batch_idx)[source]#: Generate samples for evaluation

test_epoch_end(outputs)[source]#: Compute metrics and log at the end of the epoch

configure_optimizers()[source]#: Configure optimizers for training Uses Adam and warmup lr

setup(stage: str)[source]#: Prepare metrics for test stage

configure_callbacks()[source]#: Configure EMA callback, will override any other EMA callback

sample_and_log(num_samples=1, length=1)[source]#

Sample data and log to logger

Parameters:

num_samples (int) – number of samples
length (int) – length of history to save in $T$ timesteps