DDPM#

In physics and chemistry, the microscopic reversibility states that

“the microscopic detailed dynamics of particles and fields is time-reversible because the microscopic equations of motion are symmetric with respect to inversion in time”

This mean that if a data distribution is diffused to noise, the reverse process exists in a microscopic level.

This is because the equations that describe the dynamics are “symmetric with respect to inversion in time”.

Assuming this reverse process exists, the Denoising Diffusion Probabilistic Model generates data by gradually denoising data starting from Gaussian noise.

Since this principle holds for “microscopic detailed dynamics”, we design a Forward Diffusion process that gradually diffuses data to Gaussian noise.

In each step, we sample from a Gaussian distribution that perturbs the data. Formally, we define it as a Markov chain of Gaussians:

\[ \begin{aligned} q(\bx_{1:T} | \bx_0) &\defeq \prod_{t=1}^T q(\bx_t | \bx_{t-1} ), \qquad q(\bx_t|\bx_{t-1}) \defeq \mathcal{N}(\bx_t;\sqrt{1-\beta_t}\bx_{t-1},\beta_t \bI) \end{aligned} \]

Note that we can sample $\bx_t$ for an arbitrary timestep $t# in closed form:

\[ \begin{aligned} \alpha_t &\defeq 1-\beta_t, \quad \bar\alpha_t \defeq \prod_{s=1}^t \alpha_s \\ q(\bx_t|\bx_0) &= \mathcal{N}(\bx_t; \sqrt{\bar\alpha_t}\bx_0, (1-\bar\alpha_t)\bI) \end{aligned} \]

If $\beta_t$ is small enough, the reverse process should also exist. And since the process is symmetric it should also be a Markov chain of Gaussians starting from $p(\bx_T)=\mathcal{N}(\bx_T; \bzero, \bI)$:

\[ \begin{aligned} p_\theta(\bx_{0:T}) &\defeq p(\bx_T)\prod_{t=1}^T p_\theta(\bx_{t-1}|\bx_t), \qquad p_\theta(\bx_{t-1}|\bx_t) \defeq \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \bSigma_\theta(\bx_t, t)) \end{aligned} \]

In order to generate data, we sample from the Standard Normal distribution then iteratively sample $p_\theta(x_{t-1}|x_t)$

For training, we optimize the variance lower bound objective from variational autoencoders.

\[ \begin{aligned} \Ea{-\log p_\theta(\bx_0)} &\leq \Eb{q}{ - \log \frac{p_\theta(\bx_{0:T})}{q(\bx_{1:T} | \bx_0)}} \\ &= \mathbb{E}_q\bigg[ -\log p(\bx_T) - \sum_{t \geq 1} \log \frac{p_\theta(\bx_{t-1} | \bx_t)}{q(\bx_t|\bx_{t-1})} \bigg] \eqqcolon L \end{aligned} \]

We can reparameterize the variance lower bound into

\[ \begin{aligned} \mathbb{E}_q \bigg[ \underbrace{\kl{q(\bx_T|\bx_0)}{p(\bx_T)}}_{L_T \, \approx \, 0} + \sum_{t > 1} \underbrace{\kl{q(\bx_{t-1}|\bx_t,\bx_0)}{p_\theta(\bx_{t-1}|\bx_t)}}_{L_{t-1}} \underbrace{-\log p_\theta(\bx_0|\bx_1)}_{L_0, \, \text{ignore}} \bigg] \end{aligned} \]

Rewriting loss as $L = L_T + \sum_{t\lt1}L_{t-1} + L_0$

\[ \begin{aligned} q(\bx_{t-1}|\bx_t,\bx_0) &= \mathcal{N}(\bx_{t-1}; \tilde\bmu_t(\bx_t, \bx_0), \tilde\beta_t \bI), \\ \text{where}\quad \tilde\bmu_t(\bx_t, \bx_0) &\defeq \frac{\sqrt{\bar\alpha_{t-1}}\beta_t }{1-\bar\alpha_t}\bx_0 + \frac{\sqrt{\alpha_t}(1- \bar\alpha_{t-1})}{1-\bar\alpha_t} \bx_t \quad \text{and} \quad \tilde\beta_t \defeq \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t \end{aligned} \]

We parameterize the neural network to closely match the forward process in $L_{t-1}$

Recall that $p_\theta(\bx_{t-1}|\bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \bSigma_\theta(\bx_t, t))$ for ${1 \lt t \leq T}$.

With $p_\theta(\bx_{t-1} | \bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \sigma_t^2\bI)$, we can write:

Experimentally, both $\sigma_t^2 = \beta_t$ and $\sigma_t^2 = \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t$ had similar results.

\[ \begin{aligned} L_{t-1} &= \mathbb{E}_q \bigg[ \frac{1}{2\sigma_t^2} \|\tilde\mu_t(x_t,x_0) - \mu_\theta(x_t, t)\|^2 \bigg] + C \\ \tilde\mu(x_t,t) &= \frac{1}{\sqrt{1-\beta_t}}\bigg(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon(x_t,t)\bigg) \\ \mu_\theta(x_t,t) &= \frac{1}{\sqrt{1-\beta_t}}\bigg(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\bigg) \end{aligned} \]

Input image data is assumed to be integers in ${0, 1, \, ... \, ,255}$ scaled linearly to $[-1, 1]$. The last step of the reverse process is set to an independent discrete decoder. At the final step of sampling, noise is not used.

Then we can simplify the loss to

\[ \begin{aligned} \E_{\bx_0, \bepsilon}\bigg[ \underbrace{\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)}}_{\lambda_t} \left\| \bepsilon - \bepsilon_\theta(\sqrt{\bar\alpha_t} \bx_0 + \sqrt{1-\bar\alpha_t}\bepsilon, t) \right\|^2 \bigg] \end{aligned} \]

For small $t$, $\lambda_t$ is too large, In the paper setting $\lambda_t = 1$ improves sample quality

\[ \begin{aligned} L_\mathrm{simple} &\defeq \E_{t \sim \mathcal{U}(1, T), \bx_0, \bepsilon}\big[ \| \bepsilon - \bepsilon_\theta(\underbrace{\sqrt{\bar\alpha_t} \bx_0 + \sqrt{1-\bar\alpha_t}\bepsilon}_{\bx_t}, t) \|^2 \big] \\ \end{aligned} \]

`DDPM`	Forward, Reverse, Sampling for DDPM
`linear_schedule`	constants increasing linearly from $10^{-4}$ to $0.02$
`UNet`
`LitDDPM`	LightningModule for training DDPM

Sampler#

class dmme.ddpm.DDPM(timesteps)[source]#

Forward, Reverse, Sampling for DDPM

Parameters:: timesteps (int) – total timesteps $T$

forward_process(x_0: Tensor, t: Tensor, noise: Tensor)[source]#

Forward Diffusion Process

Samples $x_t$ from $q(x_t|x_0) = \mathcal{N}(x_t;\sqrt{\bar\alpha_t}\bold{x}_0,(1-\bar\alpha_t)\bold{I})$

Computes $\bold{x}_t = \sqrt{\bar\alpha_t}\bold{x}_0 + \sqrt{1-\bar\alpha_t}\bold{I}$

Parameters:

x_0 (torch.Tensor) – data to add noise to
t (int) – $t$ in $x_t$
noise (torch.Tensor, optional) – $\epsilon$, noise used in the forward process

Returns:

$\bold{x}_t \sim q(\bold{x}_t|\bold{x}_0)$

Return type:

(torch.Tensor)

reverse_process(model, x_t, t, noise)[source]#

Reverse Denoising Process

Samples $x_{t-1}$ from $p_\theta(\bold{x}_{t-1}|\bold{x}_t) = \mathcal{N}(\bold{x}_{t-1};\mu_\theta(\bold{x}_t, t), \sigma_t\bold{I})$

\[\begin{aligned} \bold\mu_\theta(\bold{x}_t, t) &= \frac{1}{\sqrt{\alpha_t}}\bigg(\bold{x}_t -\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\bold{x}_t,t)\bigg) \\ \sigma_t &= \beta_t \end{aligned} \]

Computes $\bold{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\bigg(\bold{x}_t -\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\bold{x}_t,t)\bigg) +\sigma_t\epsilon$

Parameters:

model (nn.Module) – model for estimating noise
x_t (torch.Tensor) – x_t
t (int) – current timestep
noise (torch.Tensor) – noise

sample(model, x_t, t, noise)[source]#

Sample from $p_\theta(x_{t-1}|x_t)$

Parameters:

model (nn.Module) – model for estimating noise
x_t (torch.Tensor) – image of shape $(N, C, H, W)$
t (int) – starting $t$ to sample from
noise (torch.Tensor) – noise to use for sampling, if None samples new noise

Returns:

generated sample of shape $(N, C, H, W)$

Return type:

(torch.Tensor)

dmme.ddpm.linear_schedule(timesteps: int, start=0.0001, end=0.02) → Tensor[source]#

constants increasing linearly from $10^{-4}$ to $0.02$

Parameters:

timesteps (int) – total timesteps
start (float) – starting value, defaults to 0.0001
end (float) – end value, defaults to 0.02

dmme.ddpm.pad(x: Tensor, value: float = 0) → Tensor[source]#: pads tensor with 0 to match $t$ with tensor index

Model#

class dmme.ddpm.UNet(in_channels, pos_dim=128, emb_dim=512, num_groups=32, dropout=0.1, channels_per_depth=(128, 256, 256, 256), num_blocks=2, attention_depths=(2,))[source]#

forward(x, c)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

`SinusoidalPositionEmbeddings`(dim)
`Attention`(dim, num_groups)
`ResBlock`(c_in, c_out[, with_attention, ...])

class dmme.ddpm.SinusoidalPositionEmbeddings(dim)[source]#

forward(t)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

class dmme.ddpm.Attention(dim, num_groups)[source]#

forward(x)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

class dmme.ddpm.ResBlock(c_in, c_out, with_attention=False, emb_dim=512, num_groups=32, p=0.1)[source]#

forward(x, c)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Training#

class dmme.ddpm.LitDDPM(model: Module, lr: float = 0.0002, warmup: int = 5000, imgsize: Tuple[int, int, int] = (3, 32, 32), timesteps: int = 1000, decay: float = 0.9999)[source]#

LightningModule for training DDPM

Parameters:

model (nn.Module) – neural network predicting noise $\epsilon_\theta$
lr (float) – learning rate, defaults to $2e-4$
warmup (int) – linearly increases learning rate for warmup steps until lr is reached, defaults to 5000
imgsize (Tuple[int, int, int]) – image size in (C, H, W)
timestpes (int) – total timesteps for the forward and reverse process, $T$
decay (float) – EMA decay value

forward(x_t: Tensor, t: int, noise: Optional[Tensor] = None)[source]#

Denoise image once using DDPM

Parameters:

x_t (torch.Tensor) – image of shape $(N, C, H, W)$
t (int) – starting $t$ to sample from
noise (torch.Tensor) – noise to use for sampling, if None samples new noise

Returns:

generated sample of shape $(N, C, H, W)$

Return type:

(torch.Tensor)

training_step(batch, batch_idx)[source]#: Train model using $L_\text{simple}$

test_step(batch, batch_idx)[source]#: Generate samples for evaluation

generate(x_t)[source]#

Iteratively sample from $p_\theta(x_{t-1}|x_t)$ to generate images

Parameters:: x_t (torch.Tensor) – $x_T$ to start from

test_epoch_end(outputs)[source]#: Compute metrics and log at the end of the epoch

configure_optimizers()[source]#: Configure optimizers for training Uses Adam and warmup lr

configure_callbacks()[source]#: Configure EMA callback, will override any other EMA callback