DDPM#
In physics and chemistry, the microscopic reversibility states that
“the microscopic detailed dynamics of particles and fields is time-reversible because the microscopic equations of motion are symmetric with respect to inversion in time”
This mean that if a data distribution is diffused to noise, the reverse process exists in a microscopic level.
This is because the equations that describe the dynamics are “symmetric with respect to inversion in time”.
Assuming this reverse process exists, the Denoising Diffusion Probabilistic Model generates data by gradually denoising data starting from Gaussian noise.
Since this principle holds for “microscopic detailed dynamics”, we design a Forward Diffusion process that gradually diffuses data to Gaussian noise.
In each step, we sample from a Gaussian distribution that perturbs the data. Formally, we define it as a Markov chain of Gaussians:
Note that we can sample \(\bx_t\) for an arbitrary timestep $t# in closed form:
If \(\beta_t\) is small enough, the reverse process should also exist. And since the process is symmetric it should also be a Markov chain of Gaussians starting from \(p(\bx_T)=\mathcal{N}(\bx_T; \bzero, \bI)\):
In order to generate data, we sample from the Standard Normal distribution then iteratively sample \(p_\theta(x_{t-1}|x_t)\)
For training, we optimize the variance lower bound objective from variational autoencoders.
We can reparameterize the variance lower bound into
Rewriting loss as \(L = L_T + \sum_{t\lt1}L_{t-1} + L_0\)
We parameterize the neural network to closely match the forward process in \(L_{t-1}\)
Recall that \(p_\theta(\bx_{t-1}|\bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \bSigma_\theta(\bx_t, t))\) for \({1 \lt t \leq T}\).
With \(p_\theta(\bx_{t-1} | \bx_t) = \mathcal{N}(\bx_{t-1}; \bmu_\theta(\bx_t, t), \sigma_t^2\bI)\), we can write:
Experimentally, both \(\sigma_t^2 = \beta_t\) and \(\sigma_t^2 = \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t\) had similar results.
Input image data is assumed to be integers in \({0, 1, \, ... \, ,255}\) scaled linearly to \([-1, 1]\). The last step of the reverse process is set to an independent discrete decoder. At the final step of sampling, noise is not used.
Then we can simplify the loss to
For small \(t\), \(\lambda_t\) is too large, In the paper setting \(\lambda_t = 1\) improves sample quality
Wrapper for computing forward and reverse processes, sampling data, and computing loss for DDPM |
|
constants increasing linearly from \(10^{-4}\) to \(0.02\) |
|
UNet with GroupNorm and Attention, Predicts noise from \(x_t\) and \(t\) |
|
LightningModule for training DDPM |
Sampler#
- class dmme.DDPMSampler(model: Module, timesteps: int)[source]#
Wrapper for computing forward and reverse processes, sampling data, and computing loss for DDPM
Paper: https://arxiv.org/abs/2006.11239
Code: https://github.com/hojonathanho/diffusion
- Parameters:
model (nn.Module) – model
timesteps (int) – diffusion timesteps
- forward(x_t, t)[source]#
Predicts the noise given \(x_t\) and \(t\)
Applies forward to the internal model
Expects \(x_t\) to have shape \((N, C, H, W)\)
- Parameters:
x_t (torch.Tensor) – image
t (int) – \(t\) in \(\bold{x}_t\)
- forward_process(x_0, t, noise=None)[source]#
Forward Diffusion Process
Samples \(x_t\) from \(q(x_t|x_0) = \mathcal{N}(x_t;\sqrt{\bar\alpha_t}\bold{x}_0,(1-\bar\alpha_t)\bold{I})\)
Computes \(\bold{x}_t = \sqrt{\bar\alpha_t}\bold{x}_0 + \sqrt{1-\bar\alpha_t}\bold{I}\)
- Parameters:
x_0 (torch.Tensor) – data to add noise to
t (int) – \(t\) in \(x_t\)
noise (torch.Tensor, optional) – \(\epsilon\), noise used in the forward process
- Returns:
\(\bold{x}_t \sim q(\bold{x}_t|\bold{x}_0)\)
- Return type:
(torch.Tensor)
- noise_schedule()[source]#
Noise Schedule for DDPM
DDPM sets \(T = 1000\) and linearly increases \(\beta_t\) from \(10^{-4}\) to \(0.02\)
- Returns:
\(\beta_1, \, ... \, ,\beta_T\) as a tensor of shape \((T,)\)
- Return type:
(torch.Tensor)
- register_alphas(beta)[source]#
Caches \(\alpha_t\) used in the forward and reverse process
\(\alpha_t\) is constant so we register them in nn.Module’s buffers
- Parameters:
beta (torch.Tensor) – beta values to use to compute alphas, a tensor of shape \((T,)\)
- reverse_process(x_t, t, noise=None)[source]#
Reverse Denoising Process
Samples \(x_{t-1}\) from \(p_\theta(\bold{x}_{t-1}|\bold{x}_t) = \mathcal{N}(\bold{x}_{t-1};\mu_\theta(\bold{x}_t, t), \sigma_t\bold{I})\)
\[\begin{aligned} \bold\mu_\theta(\bold{x}_t, t) &= \frac{1}{\sqrt{\alpha_t}}\bigg(\bold{x}_t -\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\bold{x}_t,t)\bigg) \\ \sigma_t &= \beta_t \end{aligned} \]Computes \(\bold{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\bigg(\bold{x}_t -\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(\bold{x}_t,t)\bigg) +\sigma_t\epsilon\)
- Parameters:
x_t (torch.Tensor) – x_t
t (int) – current timestep
noise (torch.Tensor) – noise
- compute_loss(x_0, t=None, noise=None)[source]#
Computes the loss
\(L_\text{simple} = \mathbb{E}_{\bold{x}_0\sim q(\bold{x}_0), \epsilon\sim\mathcal{N}(\bold{0},\bold{I}), t\sim\mathcal{U}(1,T)} \left[\|\epsilon-\epsilon_\theta(\bold{x}_t, t) \|^2\right]\)
- Parameters:
x_0 (torch.Tensor) – \(x_0\)
t (int, optional) – sampled \(t\)
noise (torch.Tensor, optional) – sampled \(\epsilon\)
- sample(x_t, t, noise=None)[source]#
Generate Samples
Iteratively sample from \(p_\theta(x_{t-1}|x_t)\) starting from \(x_T\)
- Parameters:
x_t (Tuple[int, int, int]) – image shape
t (int) – timestep \(t\) to sample from
- Returns:
sample from \(p_\theta(x_{t-1}|x_t)\) starting from \(x_T\)
- Return type:
(torch.Tensor)
Model#
- class dmme.ddpm.UNet(in_channels=3, dim=128, pos_dim=128, emb_dim=512, multipliers=(1, 2, 2, 2), attn_depth=(2,), groups=32, dropout=0.1)[source]#
UNet with GroupNorm and Attention, Predicts noise from \(x_t\) and \(t\)
- Parameters:
in_channels (int) – input image channels
dim (int) – initial dim
pos_dim (int) – sinusoidal position encoding dim
emb_dim (int) – time embedding mlp dim
multipliers (Tuple[int...]) – list of channel multipliers
attn_depth (Tuple[int...]) – depth where attention is applied
groups (int) – number of groups in nn.GroupNorm
dropout (float) – dropout in ResBlock
- forward(x, t)[source]#
Using timestep embeddings, predict noise to denoise \(x_t\) from \(x_t\) and \(t\) using a UNet
- Parameters:
x (torch.Tensor) – \(x_t\), tensor of shape \((N, C, H, W)\)
t (int) – \(t\)
- Returns:
\(\epsilon_\theta(x_t,t)\) predicted noise from image, a tensor of shape \((N, C, H, W)\)
- Return type:
(torch.Tensor)
|
Timestep embedding network |
Transformer position embedding |
|
|
Convolutional Block with multiple resblocks |
|
Downsampling layer |
|
Upsampling layer |
|
Multi Head Self Attention layer |
|
ResBlock for UNet |
|
convolution layer builder with normalization and activation |
- class dmme.ddpm.unet.TimeStepEmbedding(pos_dim=64, emb_dim=256)[source]#
Timestep embedding network
- Parameters:
pos_dim (int) – sinusoidal position encoding dim
emb_dim (int) – time embedding mlp dim
- class dmme.ddpm.unet.SinusoidalPositionEmbeddings(dim)[source]#
Transformer position embedding
- Parameters:
dim (int) – dim
- class dmme.ddpm.unet.Block(in_channels, out_channels, emb_dim, groups, dropout, num_blocks=2, add_attention=False)[source]#
Convolutional Block with multiple resblocks
- Parameters:
in_channels (int) – number of input channels
out_channels (int) – number of output channels
emb_dim (int) – time embedding dim
groups (int) – num groups in nn.GroupNorm
dropout (float) – dropout used in ResBlock
num_blocks (int) – number of resblokcs used
add_attention (bool) – whether to add attention to the final layer
- class dmme.ddpm.unet.DownSample(dim)[source]#
Downsampling layer
- Parameters:
dim (int) – number of input and output channels
- class dmme.ddpm.unet.UpSample(dim, scale_factor)[source]#
Upsampling layer
- Parameters:
dim (int) – number of input and output channels
scale_factor (float) – upsample scale
- class dmme.ddpm.unet.Attention(dim, groups=8)[source]#
Multi Head Self Attention layer
- Parameters:
dim (int) – \(d_\text{model}\)
groups (int) – num groups in nn.GroupNorm
- class dmme.ddpm.unet.ResBlock(in_channels, out_channels, emb_dim, groups=8, dropout=0.0)[source]#
ResBlock for UNet
- Parameters:
in_channels (int) – number of input channels
out_channels (int) – number of output channels
emb_dim (int) – timestep embedding dim
groups (int) – num groups in nn.GroupNorm
dropout (float) – dropout applied in each conv
- forward(x, t)[source]#
ResBlock with time embeddings
ResBlock with two convolution layers with residual connections. Adds time embedding to the first layer’s output using an mlp to match dimensions. Then normalization, activation , dropout is applied in that order. The second convolutional layer is identical to the basic resblock.
- Parameters:
x (torch.Tensor) – \(x_t\), tensor of shape \((N, C, H, W)\)
t (int) – \(t\)
- Returns:
tensor of shape \((N, C, H, W)\) where \(C\) is out_channels
- Return type:
x (torch.Tensor)
- property norm#
Returns copies of normalizaiton layers
- property act#
Returns copies of activation layers
- dmme.ddpm.unet.conv2d(in_channels, out_channels, kernel_size, padding, norm=None, act=None)[source]#
convolution layer builder with normalization and activation
- Parameters:
in_channels (int) – number of input channels
out_channels (int) – number of output channels
kernel_size (int) – kernel size
padding (int) – padding
norm (nn.Module) – normalization layer instance
act (nn.Module) – activation function instance
Training#
- class dmme.LitDDPM(sampler: Optional[Module] = None, lr: float = 0.0002, warmup: int = 5000, imgsize: Tuple[int, int, int] = (3, 32, 32), timesteps: int = 1000, decay: float = 0.9999)[source]#
LightningModule for training DDPM
- Parameters:
sampler (nn.Module) – an instance of DDPMSampler
lr (float) – learning rate, defaults to \(2e-4\)
warmup (int) – linearly increases learning rate for warmup steps until lr is reached, defaults to 5000
imgsize (Tuple[int, int, int]) – image size in (C, H, W)
timestpes (int) – total timesteps for the forward and reverse process, \(T\)
decay (float) – EMA decay value
- forward(x_t, start_t, stop_t=0, step_t=-1, noise=None)[source]#
Iteratively sample from \(p_\theta(x_{t-1}|x_t)\) starting with \(x_t\) with start, stop step specified from arguments
- Parameters:
x_t (torch.Tensor) – image of shape \((N, C, H, W)\)
start_t (int) – starting \(t\) to sample from
stop_t (int) – stops sampling when reached
steps_t (int) – step sizes for sequence
noise (torch.Tensor) – noise to use for sampling, if None samples new noise
- Returns:
generated samples
- Return type:
(torch.Tensor)