DDPM#

In physics and chemistry, the microscopic reversibility states that

“the microscopic detailed dynamics of particles and fields is time-reversible because the microscopic equations of motion are symmetric with respect to inversion in time”

This mean that if a data distribution is diffused to noise, the reverse process exists in a microscopic level.

This is because the equations that describe the dynamics are “symmetric with respect to inversion in time”.

Assuming this reverse process exists, the Denoising Diffusion Probabilistic Model generates data by gradually denoising data starting from Gaussian noise.

Since this principle holds for “microscopic detailed dynamics”, we design a Forward Diffusion process that gradually diffuses data to Gaussian noise.

In each step, we sample from a Gaussian distribution that perturbs the data. Formally, we define it as a Markov chain of Gaussians:

\[ \begin{aligned} q(\bx_{1:T} | \bx_0) &\defeq \prod_{t=1}^T q(\bx_t | \bx_{t-1} ), \qquad q(\bx_t|\bx_{t-1}) \defeq \mathcal{N}(\bx_t;\sqrt{1-\beta_t}\bx_{t-1},\beta_t \bI) \end{aligned} \]

Note that we can sample $\bx_t$ for an arbitrary timestep $t# in closed form:

\[ \begin{aligned} \alpha_t &\defeq 1-\beta_t, \quad \bar\alpha_t \defeq \prod_{s=1}^t \alpha_s \\ q(\bx_t|\bx_0) &= \mathcal{N}(\bx_t; \sqrt{\bar\alpha_t}\bx_0, (1-\bar\alpha_t)\bI) \end{aligned} \]

If $\beta_t$ is small enough, the reverse process should also exist. And since the process is symmetric it should also be a Markov chain of Gaussians starting from $p(\bx_T)=\mathcal{N}(\bx_T; \bzero, \bI)$:

\[ \begin{aligned} p_\theta(\bx_{0:T}) &\defeq p(\bx_T)\prod_{t=1}^T p_\theta(\bx_{t-1}|\bx_t), \qquad p_\theta(\bx_{t-1}|\bx_t) \defeq \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \bSigma_\theta(\bx_t, t)) \end{aligned} \]

In order to generate data, we sample from the Standard Normal distribution then iteratively sample $p_\theta(x_{t-1}|x_t)$

For training, we optimize the variance lower bound objective from variational autoencoders.

\[ \begin{aligned} \Ea{-\log p_\theta(\bx_0)} &\leq \Eb{q}{ - \log \frac{p_\theta(\bx_{0:T})}{q(\bx_{1:T} | \bx_0)}} \\ &= \mathbb{E}_q\bigg[ -\log p(\bx_T) - \sum_{t \geq 1} \log \frac{p_\theta(\bx_{t-1} | \bx_t)}{q(\bx_t|\bx_{t-1})} \bigg] \eqqcolon L \end{aligned} \]

We can reparameterize the variance lower bound into

\[ \begin{aligned} \mathbb{E}_q \bigg[ \underbrace{\kl{q(\bx_T|\bx_0)}{p(\bx_T)}}_{L_T \, \approx \, 0} + \sum_{t > 1} \underbrace{\kl{q(\bx_{t-1}|\bx_t,\bx_0)}{p_\theta(\bx_{t-1}|\bx_t)}}_{L_{t-1}} \underbrace{-\log p_\theta(\bx_0|\bx_1)}_{L_0, \, \text{ignore}} \bigg] \end{aligned} \]

Rewriting loss as $L = L_T + \sum_{t\lt1}L_{t-1} + L_0$

\[ \begin{aligned} q(\bx_{t-1}|\bx_t,\bx_0) &= \mathcal{N}(\bx_{t-1}; \tilde\bmu_t(\bx_t, \bx_0), \tilde\beta_t \bI), \\ \text{where}\quad \tilde\bmu_t(\bx_t, \bx_0) &\defeq \frac{\sqrt{\bar\alpha_{t-1}}\beta_t }{1-\bar\alpha_t}\bx_0 + \frac{\sqrt{\alpha_t}(1- \bar\alpha_{t-1})}{1-\bar\alpha_t} \bx_t \quad \text{and} \quad \tilde\beta_t \defeq \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t \end{aligned} \]

We parameterize the neural network to closely match the forward process in $L_{t-1}$

Recall that $p_\theta(\bx_{t-1}|\bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \bSigma_\theta(\bx_t, t))$ for ${1 \lt t \leq T}$.

With $p_\theta(\bx_{t-1} | \bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \sigma_t^2\bI)$, we can write:

Experimentally, both $\sigma_t^2 = \beta_t$ and $\sigma_t^2 = \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t$ had similar results.

\[ \begin{aligned} L_{t-1} &= \mathbb{E}_q \bigg[ \frac{1}{2\sigma_t^2} \|\tilde\mu_t(x_t,x_0) - \mu_\theta(x_t, t)\|^2 \bigg] + C \\ \tilde\mu(x_t,t) &= \frac{1}{\sqrt{1-\beta_t}}\bigg(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon(x_t,t)\bigg) \\ \mu_\theta(x_t,t) &= \frac{1}{\sqrt{1-\beta_t}}\bigg(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\bigg) \end{aligned} \]

Input image data is assumed to be integers in ${0, 1, \, ... \, ,255}$ scaled linearly to $[-1, 1]$. The last step of the reverse process is set to an independent discrete decoder. At the final step of sampling, noise is not used.

Then we can simplify the loss to

\[ \begin{aligned} \E_{\bx_0, \bepsilon}\bigg[ \underbrace{\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)}}_{\lambda_t} \left\| \bepsilon - \bepsilon_\theta(\sqrt{\bar\alpha_t} \bx_0 + \sqrt{1-\bar\alpha_t}\bepsilon, t) \right\|^2 \bigg] \end{aligned} \]

For small $t$, $\lambda_t$ is too large, In the paper setting $\lambda_t = 1$ improves sample quality

\[ \begin{aligned} L_\mathrm{simple} &\defeq \E_{t \sim \mathcal{U}(1, T), \bx_0, \bepsilon}\big[ \| \bepsilon - \bepsilon_\theta(\underbrace{\sqrt{\bar\alpha_t} \bx_0 + \sqrt{1-\bar\alpha_t}\bepsilon}_{\bx_t}, t) \|^2 \big] \\ \end{aligned} \]

`DDPM`	Forward, Reverse, Sampling for DDPM
`linear_schedule`	constants increasing linearly from $10^{-4}$ to $0.02$
`UNet`	UNet with GroupNorm and Attention, Predicts noise from $x_t$ and $t$
`LitDDPM`	LightningModule for training DDPM

Sampler#

class dmme.ddpm.DDPM(timesteps)[source]#

Forward, Reverse, Sampling for DDPM

Parameters:: timesteps (int) – total timesteps $T$

forward_process(x_0: Tensor, t: Tensor, noise: Tensor)[source]#

Forward Diffusion Process

Samples $x_t$ from $q(x_t|x_0) = \mathcal{N}(x_t;\sqrt{\bar\alpha_t}\bold{x}_0,(1-\bar\alpha_t)\bold{I})$

Computes $\bold{x}_t = \sqrt{\bar\alpha_t}\bold{x}_0 + \sqrt{1-\bar\alpha_t}\bold{I}$

Parameters:

x_0 (torch.Tensor) – data to add noise to
t (int) – $t$ in $x_t$
noise (torch.Tensor, optional) – $\epsilon$, noise used in the forward process

Returns:

$\bold{x}_t \sim q(\bold{x}_t|\bold{x}_0)$

Return type:

(torch.Tensor)

reverse_process(model, x_t, t, noise)[source]#

Reverse Denoising Process

Samples $x_{t-1}$ from $p_\theta(\bold{x}_{t-1}|\bold{x}_t) = \mathcal{N}(\bold{x}_{t-1};\mu_\theta(\bold{x}_t, t), \sigma_t\bold{I})$

\[\begin{aligned} \bold\mu_\theta(\bold{x}_t, t) &= \frac{1}{\sqrt{\alpha_t}}\bigg(\bold{x}_t -\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\bold{x}_t,t)\bigg) \\ \sigma_t &= \beta_t \end{aligned} \]

Computes $\bold{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\bigg(\bold{x}_t -\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\bold{x}_t,t)\bigg) +\sigma_t\epsilon$

Parameters:

model (nn.Module) – model for estimating noise
x_t (torch.Tensor) – x_t
t (int) – current timestep
noise (torch.Tensor) – noise

sample(model, x_t, t, noise)[source]#

Sample from $p_\theta(x_{t-1}|x_t)$

Parameters:

model (nn.Module) – model for estimating noise
x_t (torch.Tensor) – image of shape $(N, C, H, W)$
t (int) – starting $t$ to sample from
noise (torch.Tensor) – noise to use for sampling, if None samples new noise

Returns:

generated sample of shape $(N, C, H, W)$

Return type:

(torch.Tensor)

dmme.ddpm.linear_schedule(timesteps: int, start=0.0001, end=0.02) → Tensor[source]#

constants increasing linearly from $10^{-4}$ to $0.02$

Parameters:

timesteps (int) – total timesteps
start (float) – starting value, defaults to 0.0001
end (float) – end value, defaults to 0.02

dmme.ddpm.pad(x: Tensor, value: float = 0) → Tensor[source]#: pads tensor with 0 to match $t$ with tensor index

Model#

class dmme.ddpm.UNet(in_channels=3, pos_dim=128, emb_dim=512, num_blocks=2, channels=(128, 256, 256, 256), attn_depth=(2,), groups=32, drop_rate=0.1)[source]#

UNet with GroupNorm and Attention, Predicts noise from $x_t$ and $t$

Parameters:

in_channels (int) – input image channels
pos_dim (int) – sinusoidal position encoding dim
emb_dim (int) – time embedding mlp dim
num_blocks (int) – number of resblocks to use
channels (Tuple[int...]) – list of channel dimensions
attn_depth (Tuple[int...]) – depth where attention is applied
groups (int) – number of groups in nn.GroupNorm
drop_rate (float) – drop_rate in ResBlock

forward(x, t)[source]#

Using timestep embeddings, predict noise to denoise $x_t$ from $x_t$ and $t$ using a UNet

Parameters:

x (torch.Tensor) – $x_t$, tensor of shape $(N, C, H, W)$
t (torch.Tensor) – $t$, tensor of shape $(N,)$

Returns:

$\epsilon_\theta(x_t,t)$ predicted noise from image, a tensor of shape $(N, C, H, W)$

Return type:

(torch.Tensor)

`SinusoidalPositionEmbeddings`(dim)	Transformer Sinusoidal Position Encoding
`Attention`(dim)	Self Attention layer
`PreNorm`(norm_layer, attention_layer)	Pre Normalization with residual connections
`ResBlock`(in_channels, out_channels, emb_dim, ...)	BasicWideResBlock for UNet GroupNorm and optional self-attention
`conv3x3`(in_channels, out_channels, groups, ...)	Build 3x3 convolution with normalization and dropout in norm act drop conv order

class dmme.ddpm.SinusoidalPositionEmbeddings(dim)[source]#

Transformer Sinusoidal Position Encoding

Parameters:: dim (int) – embedding dimension

forward(t)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class dmme.ddpm.Attention(dim)[source]#

Self Attention layer

Parameters:: dim (int) – $d_\text{model}$

forward(x)[source]#

Multi Head Self Attention on images with prenorm and residual connections

Returns:: x

class dmme.ddpm.PreNorm(norm_layer, attention_layer)[source]#

Pre Normalization with residual connections

Parameters:

norm_layer (nn.Module) – normalization layer
attention_layer (nn.Module) – attention layer

forward(x)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

class dmme.ddpm.ResBlock(in_channels, out_channels, emb_dim, groups, drop_rate, attention=False)[source]#

BasicWideResBlock for UNet GroupNorm and optional self-attention

Parameters:

in_channels (int) – number of input channels
out_channels (int) – number of output channels
emb_dim (int) – timestep embedding dim
groups (int) – num groups in nn.GroupNorm
drop_rate (float) – dropout applied in each conv
attention (bool) – flag for adding self-attention layer

forward(x, t)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

dmme.ddpm.conv3x3(in_channels, out_channels, groups, drop_rate)[source]#

Build 3x3 convolution with normalization and dropout in norm act drop conv order

Parameters:

in_channels (int) – passed to nn.Conv2d
out_channels (int) – passed to nn.Conv2d
groups (int) – passed to nn.GroupNorm
drop_rate (float) – passed to nn.Dropout2d

Training#

class dmme.ddpm.LitDDPM(model: Module, lr: float = 0.0002, warmup: int = 5000, imgsize: Tuple[int, int, int] = (3, 32, 32), timesteps: int = 1000, decay: float = 0.9999)[source]#

LightningModule for training DDPM

Parameters:

model (nn.Module) – neural network predicting noise $\epsilon_\theta$
lr (float) – learning rate, defaults to $2e-4$
warmup (int) – linearly increases learning rate for warmup steps until lr is reached, defaults to 5000
imgsize (Tuple[int, int, int]) – image size in (C, H, W)
timestpes (int) – total timesteps for the forward and reverse process, $T$
decay (float) – EMA decay value

forward(x_t: Tensor, t: int, noise: Optional[Tensor] = None)[source]#

Denoise image once using DDPM

Parameters:

x_t (torch.Tensor) – image of shape $(N, C, H, W)$
t (int) – starting $t$ to sample from
noise (torch.Tensor) – noise to use for sampling, if None samples new noise

Returns:

generated sample of shape $(N, C, H, W)$

Return type:

(torch.Tensor)

training_step(batch, batch_idx)[source]#: Train model using $L_\text{simple}$

test_step(batch, batch_idx)[source]#: Generate samples for evaluation

generate(x_t)[source]#

Iteratively sample from $p_\theta(x_{t-1}|x_t)$ to generate images

Parameters:: x_t (torch.Tensor) – $x_T$ to start from

test_epoch_end(outputs)[source]#: Compute metrics and log at the end of the epoch

configure_optimizers()[source]#: Configure optimizers for training Uses Adam and warmup lr

configure_callbacks()[source]#: Configure EMA callback, will override any other EMA callback