Fast Timing-Conditioned Latent Audio Diffusion | Stability.AI

Jack Tol

2024-05-04

Section 1 - Introduction

1.1 - The Motivation

Generative AI is simply amazing:

  • Text
  • Images
  • Video
  • Audio

1.2 - Diffusion Models - What are they?

  • A diffusion model is a type of generative machine learning model that gradually constructs data (images, video, audio) starting from a noisy distribution and progressively denoising it to produce complex samples.
  • The model works by first adding noise to data over several steps to create a simple noise distribution, then learning to reverse this process to reconstruct the original data from the noise.
  • During training, the model learns to predict the noise added to the data, rather than directly predicting the data itself.

1.3 - Diffusion Models - Conditioning

  • Conditioning a diffusion model involves providing additional input, like text or labels, to guide the generation process towards a desired outcome.
  • This conditioning allows for precise control over the features of the generated output, such as style, content, or specific attributes.
  • To achieve this, modifications are made to the model’s architecture, training process, and input data to integrate and utilize the conditional data effectively.

1.4 - Diffusion Models - Applications

Applies to:

  • Denoising
  • Inpainting
  • Super-resolution
  • Image, Video, & Audio Generation

1.5 - Diffusion Models - The Type of Data We Use

Section 2 - Stable Audio Architecture

2.1 - Stable Audio Architecture

At the highest level, Stable Audio is a Latent Diffusion Model consisting of:

  • Variational Autoencoder
  • Conditioning Signals
  • Diffusion Model

2.2 - Stable Audio Diagram

2.3 - Variational Autoencoder

  • Faster Generation & Training Time
  • Improved Audio Reconstruction at High Compression Ratios
  • Trained From Scratch on Custom Dataset
  • Downsamples Stereo Audio by a factor of 1024, Latent has 64 Dimensions
  • Overall Compression Ratio of 32:1

2.3.1 - Variational Autoencoder Simple Diagram

2.3.2 - Variational Autoencoder Complex Diagram

Autoencoder Diagram

  • Discriminator → Adversarial & Feature Matching Losses using STFT Loss
  • Each loss is weighted as follows:
    • 1.0 for spectral losses
    • 0.1 for adversarial losses
    • 5.0 for the feature matching loss
    • 1e-4 for the KL loss
  • R0, R1, R2 → Residual Connections
  • ⊕ → Summation Operator
  • D1, Dn → Dilated Convolution
  • Snake → Snake Activations

2.4 - Conditioning | CLAP Text Encoder

Acronyms

  • CLAP - Contrastive Language-Audio Pretraining
  • BERT - Bidirectional Encoder Representations from Transformers
  • HTSAT - Hierarchical Token Semantic Audio Transformer
  • RoBERTa - A Robustly Optimized BERT Pretraining Approach

2.4.1 - Conditioning | CLAP Text Encoder Diagram

2.4.2 - Conditioning | CLAP Text Encoder Cont.

  • Trained From Scratch on Custom Dataset
  • Both encoders trained using Language-Audio Contrastive Loss
  • CLAP is used due to its native multimodality between words & audio
  • Stable Audio’s CLAP Implementation can outperform the open-source CLAP and T5 embeddings models
  • Uses the next-to-last layer to provide a better conditioning signal than the text features from the final layer
  • These text features are provided to the diffusion U-Net through cross-attention layers

2.5 - Conditioning | Timing Embeddings

  • Two Properties are Calculated when Gathering a Chunk of Audio from the Dataset:
    • The second from which the chunk starts, denoted as seconds_start
    • The number of seconds in the original audio file, denoted as seconds_total

These values are:

  • Translated into per-second discrete learned embeddings
  • Concatenated along the sequence dimension with text features derived from the prompt, providing a comprehensive context for processing
  • And then, passed into the U-Net’s cross-attention layers

During inference, seconds_start and seconds_total serve as conditioning variables, enabling users to generate variable-length outputs.

When training with audio files shorter than the training window, padding with silence is used up to the training window length.

Suppose we take a 95-sec chunk from a 180-sec audio file, with the chunk starting 14-sec in, then:

  • seconds_start = 14
  • seconds_total = 180

2.5.1 - Conditioning | Timing Embeddings Diagram

2.6 - Diffusion Model Diagram

2.6.1 - Diffusion Model Cont.

  • Based on a 907M parameter UNet
  • 4 Levels of Symmetrical downsampling encoder blocks and upsampling decoder blocks
  • Skip Connections added between encoder and decoder blocks
  • The 4 levels have channel counts of 1024, 1024, 1024, and 1280, and downsample by factors of 1, 2, 2, and 4 respectively.
  • Each encoder & decoder block has 3 attention layers
  • The diffusion timestep conditioning is passed in through FiLM layers to modulate the model activations based on the noise level.
  • Prompt & Timing Conditioning → Model through Cross Attention Layers

Section 3 - Training

3.1 - Dataset

806,284 Audio Samples totaling over 19,500 hours, consisting of:

  • Music - 66% by Number of Files | 94% by Storage in GB’s
  • Sound Effects - 25% by Number of Files | 5% by Storage in GB’s
  • Instrument Stems - 9% by Number of Files | 1% by Storage in GB’s

With corresponding text metadata from AudioSparx.

3.2 - Variational Autoencoder

  • Uses Automatic Mixed Precision for 1.1M steps
  • Effective batch size of 256
  • 16 A100 GPUs
  • After 460,000 steps the encoder was frozen and the decoder was fine-tuned for an additional 640,000 steps
  • Utilized multi-resolution sum and difference STFT loss for stereo signals with A-weighting before STFT, using window lengths of 2048, 1024, 512, 256, 128, 64, and 32. Weighted losses: 1.0 spectral, 0.1 adversarial, 5.0 feature matching, 1e-4 KL.
  • Discriminators with STFT window lengths of 2048, 1024, 512, 256, and 128, use complex STFT representation and patch-based hinge loss.

3.3 - Text Encoder

  • CLAP Model trained for 100 epochs from scratch on the custom dataset
  • Effective batch size of 6,144
  • 64 A100 GPUs
  • CLAP Authors Setup
  • Language-Audio Contrastive Loss

3.4 - Diffusion Model

  • Exponential Moving Average & Automatic Mixed Precision for 640,000 steps
  • 64 A100 GPUs
  • Effective batch size of 256
  • Resampled to 44.1kHz & sliced to 4,194,304 samples (95.1 sec)
  • Longer files cropped, shorter files padded
  • V-Objective with Cosine Noise Scheduler
  • 10% Dropout on Conditioning Signals to be able to use Classifier-Free Guidance
  • Text Encoder Frozen While Training the Diffusion Model

Section 4 - Methodology

4.1 - Quantitative Metrics | \(FD_{Openl3}\)

  • The Fréchet Distance (FD) measures the similarity between the statistics of generated and reference audio sets in a feature space.
  • A low FD suggests that the generated audio closely matches the reference audio.
  • Unlike the older VGGish method, which operates at 16 kHz, the \(Openl3\) feature space is used as it accepts signals up to 48 kHz, enhancing the ability to analyze high-quality audio.
  • FD extended to evaluate stereo signals by independently processing left and right channels in the \(Openl3\) feature space, then concatenating them to create stereo features.
  • This novel \(FD_{Openl3}\) metric allows more accurate assessments of higher quality, stereo audio content and is used.

4.2 - Quantitative Metrics | \(KL_{PaSST}\)

  • PaSST is a state-of-the-art audio tagger trained on AudioSet used to compute the KL Divergence over the probabilities of labels between:
    • Generated Audio
    • Reference Audio
  • The generated audio is expected to share similar semantics (tags) with the reference audio when the KL is low.
  • The \(KL_{PaSST}\) is adapted to evaluate audio of varying lengths:
    • Segmenting the audio into overlapping analysis windows
    • Calculating the mean (across windows) of the generated logits and then applying a softmax

4.3 - Quantitative Metrics | \(CLAP_{score}\)

  • The cosine similarity is computed between the \(CLAP_{LAION}\) text embedding of the given text prompt and the \(CLAP_{LAION}\) audio embedding of the generated audio.
  • A high \(CLAP_{score}\) denotes that the generated audio adheres to the given text prompt.

4.4 - Qualitative Metrics

  • Audio Quality - Evaluates whether the generated audio is of low-fidelity with artifacts or high-fidelity.
  • Text Alignment - Evaluates how the generated audio adheres to the given text prompt.
  • Musicality (music only) - Evaluates the capacity of the model to articulate melodies and harmonies.
  • Stereo Correctness (stereo only) - Evaluates the appropriateness of the generated spatial image.
  • Musical Structure (music only) - Evaluates if the generated song contains intro, development, and/or outro.

Human Ratings Collected:

  • 0 | Bad
  • 1 | Poor
  • 2 | Fair
  • 3 | Good
  • 4 | Excellent

4.5 - Evaluation Data | Experiments

  • MusicCaps & AudioCaps standard benchmarks are used.
  • MusicCaps contains 5521 music segments from YouTube, each with 1 caption.
  • AudioCaps contains 979 audio segments from YouTube, each with several captions.
  • For each model tested, an audio is generated per caption (5521 for MC & 4875 for AC).
  • MC & AC only represent 10 sec segments of audio, therefore, the full-length audios were also considered.
  • However, results do not hold consistently due to the label only accurately representing the 10 sec audios, while the generated audio ranged from 10 sec to 95 secs.

4.6 - Evaluation Data | Qualitative Experiments

  • Prompts for were randomly picked from MusicCaps and AudioCaps.
  • Avoided captions which included “low quality” or a similar phrase to focus on high-fidelity synthesis.
  • Avoided ambient music because users found it challenging to evaluate musicality.
  • Avoided speech-related prompts since it was out of focus.

4.7 - Evaluation Data | Quantitative Results on MusicCaps

4.8 - Evaluation Data | Quantitative Results on AudioCaps

4.9 - Evaluation Data | Qualitative Results on Both MusicCaps & AudioCaps

Section 5 - Experiments

5.1 - How does the autoencoder impact audio fidelity?

Stable Audio VAE Reconstruction Demo

MusicCaps

AudioCaps

5.2 - How accurate is the timing conditioning?

  • The model consistently produces audio matching the expected lengths.
  • Increased errors occur in the 40-60 seconds range due to less training data.
  • Audio length is measured using a basic energy threshold to detect silence.
  • Shortest lengths may be inaccurately short due to limitations in the silence detection method.

Section 6 - Conclusion

Conclusion

  • Enables the rapid generation of variable-length, long-form stereo music and sounds at 44.1kHz from textual and timing inputs.
  • Novel qualitative and quantitative metrics were used for evaluating long-form full-band stereo signals, and found Stable Audio to be a top contender, if not the top performer, in two public benchmarks.
  • Differently from other state-of-the-art models, Stable Audio can generate music with structure and stereo sound effects.