DDPM#

In physics and chemistry, the microscopic reversibility states that

“the microscopic detailed dynamics of particles and fields is time-reversible because the microscopic equations of motion are symmetric with respect to inversion in time”

This means that the diffusion of particles can be reversed in a microscopic level.

Assuming this principle also holds for images, we could train a neural network to learn the reverse denoising process from the diffusion process of images to noise as they are symmetric.

This is generaly what Denoising Diffusion Probabilistic Models do, they generate data by gradually denoising data starting from Gaussian noise.

Since this principle holds for “microscopic detailed dynamics”, the Forward Diffusion process is designed so that it gradually diffuses data to Gaussian noise.

In each step, we sample from a Gaussian distribution that perturbs the data. Formally, we define it as a Markov chain of Gaussians:

\[ \begin{aligned} q(\bx_{1:T} | \bx_0) &\defeq \prod_{t=1}^T q(\bx_t | \bx_{t-1} ), \qquad q(\bx_t|\bx_{t-1}) \defeq \mathcal{N}(\bx_t;\sqrt{1-\beta_t}\bx_{t-1},\beta_t \bI) \end{aligned} \]

Diffusion models scale down the data with each forward process step (by a \(\sqrt{1-\beta_t}\) factor) so that variance does not grow when adding noise, thus providing consistently scaled inputs to the nerual net reverse process.

Note that we can sample \(\bx_t\) for an arbitrary timestep \(t\) in closed form:

\[ \begin{aligned} \alpha_t &\defeq 1-\beta_t, \quad \bar\alpha_t \defeq \prod_{s=1}^t \alpha_s \\ q(\bx_t|\bx_0) &= \mathcal{N}(\bx_t; \sqrt{\bar\alpha_t}\bx_0, (1-\bar\alpha_t)\bI) \end{aligned} \]

\(\beta_t\) is chosen to be small enough relative to data scaled to \([-1, 1]\), this ensures we are taking microscopoic steps and \(T\) is chosen big enough so that the data is completely diffused to Gaussian noise.

Since the forward and reverese process is symmetric, the revere denoising process should also be a Markov chain of Gaussians starting from \(p(\bx_T)=\mathcal{N}(\bx_T; \bzero, \bI)\):

\[ \begin{aligned} p_\theta(\bx_{0:T}) &\defeq p(\bx_T)\prod_{t=1}^T p_\theta(\bx_{t-1}|\bx_t), \qquad p_\theta(\bx_{t-1}|\bx_t) \defeq \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \bSigma_\theta(\bx_t, t)) \end{aligned} \]

In order to generate data, we sample from the Standard Normal distribution then iteratively sample \(p_\theta(x_{t-1}|x_t)\). We use a discrete decoder in the final denoising step by setting the noise to zero.

For training, we optimize the variance lower bound objective from variational autoencoders.

\[ \begin{aligned} \Ea{-\log p_\theta(\bx_0)} &\leq \Eb{q}{ - \log \frac{p_\theta(\bx_{0:T})}{q(\bx_{1:T} | \bx_0)}} \\ &= \mathbb{E}_q\bigg[ -\log p(\bx_T) - \sum_{t \geq 1} \log \frac{p_\theta(\bx_{t-1} | \bx_t)}{q(\bx_t|\bx_{t-1})} \bigg] \eqqcolon L \end{aligned} \]

We can reparameterize the variance lower bound into

\[ \begin{aligned} \mathbb{E}_q \bigg[ \underbrace{\kl{q(\bx_T|\bx_0)}{p(\bx_T)}}_{L_T \, \approx \, 0} + \sum_{t > 1} \underbrace{\kl{q(\bx_{t-1}|\bx_t,\bx_0)}{p_\theta(\bx_{t-1}|\bx_t)}}_{L_{t-1}} \underbrace{-\log p_\theta(\bx_0|\bx_1)}_{L_0, \, \text{ignore}} \bigg] \end{aligned} \]

Rewriting loss as \(L = L_T + \sum_{t\lt1}L_{t-1} + L_0\)

\[ \begin{aligned} q(\bx_{t-1}|\bx_t,\bx_0) &= \mathcal{N}(\bx_{t-1}; \tilde\bmu_t(\bx_t, \bx_0), \tilde\beta_t \bI), \\ \text{where}\quad \tilde\bmu_t(\bx_t, \bx_0) &\defeq \frac{\sqrt{\bar\alpha_{t-1}}\beta_t }{1-\bar\alpha_t}\bx_0 + \frac{\sqrt{\alpha_t}(1- \bar\alpha_{t-1})}{1-\bar\alpha_t} \bx_t \quad \text{and} \quad \tilde\beta_t \defeq \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t \end{aligned} \]

We parameterize the neural network to closely match the forward process in \(L_{t-1}\)

Recall that \(p_\theta(\bx_{t-1}|\bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \bSigma_\theta(\bx_t, t))\) for \({1 \lt t \leq T}\).

With \(p_\theta(\bx_{t-1} | \bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \sigma_t^2\bI)\), we can write:

Experimentally, both \(\sigma_t^2 = \beta_t\) and \(\sigma_t^2 = \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t\) had similar results.

\[ \begin{aligned} L_{t-1} &= \mathbb{E}_q \bigg[ \frac{1}{2\sigma_t^2} \|\tilde\mu_t(x_t,x_0) - \mu_\theta(x_t, t)\|^2 \bigg] + C \\ \tilde\mu(x_t,t) &= \frac{1}{\sqrt{1-\beta_t}}\bigg(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon(x_t,t)\bigg) \\ \mu_\theta(x_t,t) &= \frac{1}{\sqrt{1-\beta_t}}\bigg(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\bigg) \end{aligned} \]

Input image data is assumed to be integers in \({0, 1, \, ... \, ,255}\) scaled linearly to \([-1, 1]\). The last step of the reverse process is set to an independent discrete decoder. At the final step of sampling, noise is not used.

Then we can simplify the loss to

\[ \begin{aligned} \E_{\bx_0, \bepsilon}\bigg[ \underbrace{\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)}}_{\lambda_t} \left\| \bepsilon - \bepsilon_\theta(\sqrt{\bar\alpha_t} \bx_0 + \sqrt{1-\bar\alpha_t}\bepsilon, t) \right\|^2 \bigg] \end{aligned} \]

For small \(t\), \(\lambda_t\) is too large, In the paper setting \(\lambda_t = 1\) improves sample quality

\[ \begin{aligned} L_\mathrm{simple} &\defeq \E_{t \sim \mathcal{U}(1, T), \bx_0, \bepsilon}\big[ \| \bepsilon - \bepsilon_\theta(\underbrace{\sqrt{\bar\alpha_t} \bx_0 + \sqrt{1-\bar\alpha_t}\bepsilon}_{\bx_t}, t) \|^2 \big] \\ \end{aligned} \]

`forward_process`	Forward Process, \(q(x_t\|x_{t-1})\)
`reverse_process`	Reverse Denoising Process, \(p_\theta(x_{t-1}\|x_t)\)
`sample_gaussian`	Samples from a gaussian distribution using the reparameterization trick
`linear_schedule`	constants increasing linearly from \(10^{-4}\) to \(0.02\)
`simple_loss`	Simple Loss objective \(L_\text{simple}\), MSE loss between noise and predicted noise
`DDPM`	Training and Sampling for DDPM
`UNet`	U-Net for predicting noise in images
`LitDDPM`	LightningModule for training DDPM

DDPM Training and Sampling#

class dmme.ddpm.DDPM(model, timesteps)[source]#

Training and Sampling for DDPM

Parameters:: timesteps (int) – total timesteps \(T\)

training_step(x_0)[source]#

Computes loss for DDPM

Parameters:: x_0 (torch.Tensor) – sample image to add noise and denoise for training
Returns:: loss, \(L_\text{simple}\)
Return type:: (torch.Tensor)

sampling_step(x_t, t)[source]#

Denoise image by sampling from \(p_\theta(x_{t-1}|x_t)\)

Parameters:

model (nn.Module) – model for estimating noise
x_t (torch.Tensor) – image of shape \((N, C, H, W)\)
t (torch.Tensor) – starting \(t\) to sample from, a tensor of shape \((N,)\)

Returns:

denoised image of shape \((N, C, H, W)\)

Return type:

(torch.Tensor)

generate(img_size: Tuple[int, int, int, int])[source]#

Generate image of shape \((N, C, H, W)\) by running the full denoising steps

Parameters:: img_size (Tuple[int, int, int, int]) – image size to generate as a tuple \((N, C, H, W)\)
Returns:: generated image of shape \((N, C, H, W)\)
Return type:: (torch.Tensor)

forward(x, t)[source]#: Predicts noise given image and timestep

dmme.ddpm.forward_process(image, alpha_bar_t, noise)[source]#

Forward Process, \(q(x_t|x_{t-1})\)

Parameters:

image (torch.Tensor) – image of shape \((N, C, H, W)\)
alpha_bar_t (torch.Tensor) – \(\bar\alpha_t\) of shape \((N, 1, 1, *)\)
noise (torch.Tensor) – noise sampled from standard normal distribution with the same shape as the image

dmme.ddpm.reverse_process(x_t, beta_t, alpha_t, alpha_bar_t, noise_in_x_t, variance, noise)[source]#

Reverse Denoising Process, \(p_\theta(x_{t-1}|x_t)\)

Parameters:

beta_t (torch.Tensor) – \(\beta_t\) of shape \((N, 1, 1, *)\)
alpha_t (torch.Tensor) – \(\alpha_t\) of shape \((N, 1, 1, *)\)
alpha_bar_t (torch.Tensor) – \(\bar\alpha_t\) of shape \((N, 1, 1, *)\)
noise_in_x_t (torch.Tensor) – estimated noise in \(x_t\) predicted by a neural network
variance (torch.Tensor) – variance of the reverse process, either learned or fixed
noise (torch.Tensor) – noise sampled from \(\mathcal{N}(0, I)\)

dmme.ddpm.sample_gaussian(mean, variance, noise)[source]#

Samples from a gaussian distribution using the reparameterization trick

Parameters:

mean (torch.Tensor) – mean of the distribution
variance (torch.Tensor) – variance of the distribution
noise (torch.Tensor) – noise sampled from \(\mathcal{N}(0, I)\)

dmme.ddpm.linear_schedule(timesteps: int, start=0.0001, end=0.02) → Tensor[source]#

constants increasing linearly from \(10^{-4}\) to \(0.02\)

Parameters:

timesteps (int) – total timesteps
start (float) – starting value, defaults to 0.0001
end (float) – end value, defaults to 0.02

dmme.ddpm.simple_loss(noise, estimated_noise)[source]#

Simple Loss objective \(L_\text{simple}\), MSE loss between noise and predicted noise

Parameters:

noise (torch.Tensor) – noise used in the forward process
estimated_noise (torch.Tensor) – estimated noise with the same shape as noise

U-Net for estimating noise in images#

`UNet`	U-Net for predicting noise in images
`SinusoidalPositionEmbeddings`	Transformer position encoding
`ResBlock`	3x3 basic resblocks with group norm, dropout and timestep embeddings
`DownSample`	Downsample blocks
`UpSample`	Upsample blocks
`Attention`	Self Attention with groupnorm

class dmme.ddpm.UNet(in_channels, pos_dim=128, emb_dim=512, num_groups=32, dropout=0.1, channels_per_depth=(128, 256, 256, 256), num_blocks=2, attention_depths=(2,))[source]#

U-Net for predicting noise in images

Parameters:

in_channels (int) – input channels of image
pos_dim (int) – dimension of position embedding
emb_dim (int) – dimension of timestep embedding
num_groups (int) – number of groups in nn.GroupNorm
dropout (float) – dropout rate in nn.Dropout2d
channels_per_depth (Tuple[int, ...]) – channels per depth
num_blocks (int) – number of resblocks to use in each depth
attention_depths (Tuple[int, ...]) – depths to use attention blocks

forward(x, c)[source]#

Predicts noise from x

Parameters:

x (torch.Tensor) – image of shape \((N, C, H, W)\)
c (torch.Tensor) – timestep of shape \((N,)\)

Returns:

estimated noise in input image x

Return type:

(torch.Tensor)

class dmme.ddpm.SinusoidalPositionEmbeddings(dim)[source]#

Transformer position encoding

Parameters:: dim (int) – number of dimensions of the position embedding, \(d_\text{emb}\)

forward(t)[source]#

Parameters:: t (torch.Tensor) – timestep of shape \((N,)\)
Returns:: Positional Embedding of shape \((N, d_\text{emb})\)
Return type:: (torch.Tensor)

class dmme.ddpm.ResBlock(c_in, c_out, with_attention=False, emb_dim=512, num_groups=32, p=0.1)[source]#

3x3 basic resblocks with group norm, dropout and timestep embeddings

Parameters:

c_in (int) – number of input channels
c_out (int) – number of output channels
with_attention (bool) – whether to add attention block
emb_dim (int) – input timestep embedding dimension
num_groups (int) – number of groups in nn.GroupNorm
p (float) – dropout rate in nn.Dropout2d

forward(x, c)[source]#

Parameters:

x (torch.Tensor) – image of shape \((N, C_\text{in}, H, W)\)
c (torch.Tensor) – timestep embedding of shape \((N, d_\text{emb})\)

Returns:

feature map of shape \((N, C_\text{out}, H, W)\)

Return type:

(torch.Tensor)

dmme.ddpm.DownSample(c_in, c_out)[source]#

Downsample blocks

Parameters:

c_in (int) – number of input channels
c_out (int) – number of output channels

Returns:

down sampling layer using 2d convolutions

Return type:

(nn.Conv2d)

class dmme.ddpm.UpSample(c_in, c_out)[source]#

Upsample blocks

Parameters:

c_in (int) – number of input channels
c_out (int) – number of output channels

forward(x)[source]#

Parameters:: x (torch.Tensor) – image of shape \((N, C_\text{in}, H, W)\)
Returns:: downsampled feature map of shape \((N, C_\text{out}, H//2, W//2)\)
Return type:: (torch.Tensor)

class dmme.ddpm.Attention(dim, num_groups)[source]#

Self Attention with groupnorm

Parameters:

dim (int) – equivalent to \(d_\text{model}\)
num_groups (int) – number of groups in nn.GroupNorm

forward(x)[source]#

Parameters:: x (torch.Tensor) – image of shape \((N, C_\text{in}, H, W)\)
Returns:: feature maps of shape \((N, C_\text{in}, H, W)\)
Return type:: (torch.Tensor)

Training Loop#

class dmme.ddpm.LitDDPM(model: Module, lr: float = 0.0002, warmup: int = 5000, imgsize: Tuple[int, int, int] = (3, 32, 32), timesteps: int = 1000, decay: float = 0.9999)[source]#

LightningModule for training DDPM

Parameters:

model (nn.Module) – neural network predicting noise \(\epsilon_\theta\)
lr (float) – learning rate, defaults to \(2e-4\)
warmup (int) – linearly increases learning rate for warmup steps until lr is reached, defaults to 5000
imgsize (Tuple[int, int, int]) – image size in (C, H, W)
timestpes (int) – total timesteps for the forward and reverse process, \(T\)
decay (float) – EMA decay value

forward(x_t: Tensor, t: int)[source]#

Denoise image once using DDPM

Parameters:

x_t (torch.Tensor) – image of shape \((N, C, H, W)\)
t (int) – starting \(t\) to sample from
noise (torch.Tensor) – noise to use for sampling, if None samples new noise

Returns:

generated sample of shape \((N, C, H, W)\)

Return type:

(torch.Tensor)

training_step(batch, batch_idx)[source]#: Train model using \(L_\text{simple}\)

test_step(batch, batch_idx)[source]#: Generate samples for evaluation

generate(img_size)[source]#

Iteratively sample from \(p_\theta(x_{t-1}|x_t)\) to generate images

Parameters:: x_t (torch.Tensor) – \(x_T\) to start from

test_epoch_end(outputs)[source]#: Compute metrics and log at the end of the epoch

configure_optimizers()[source]#: Configure optimizers for training Uses Adam and warmup lr

configure_callbacks()[source]#: Configure EMA callback, will override any other EMA callback