Audio Transformers

spectrogram is a 2D tensor of shape freq, seq length. its just like an image. If we pass it to an CNN classifier, we will get very good predictions. Audio spectrogram transformer (AST) uses self-attention layers so model is better able to capture global context. AST model splits audio spectrogram into a sequence of overlapping image patches of 16x16 pixels. these patches are projected into embedded space and given to transformer encoder as input. AST is an encoder only transformer

if our input is one-second audio file, the ASR model first downasamples the audio input using CNN features encoder to shorter seequence of hidden statesl, where there is one hidden state vector for every 20ms of audio, we forward 50 hidden states to the transformer encoder. audio segments extracted are partially overlapped, even though one hidden state is emitted every 20ms, each hidden-state actually represents 25ms of audio. Transformer encoder predicts one feature representtiaon for eahc hidden statesm each of the output has a dimensionality of 768, the output shape is 768, 50. each fo these predicctions covers 25ms which is shorter that the duration ofa pehoneme.

B_R_II_O_N_||_S_AWW_|||||_S_OMEE_TH_ING_||_C_L_O_S_E||TO|_P_A_N_I_C_||_ON||HHI_S||_OP_P_O_N_EN_T_'SS||_F_AA_C_E||_W_H_EN||THE||M_A_NN_||||_F_I_N_AL_LL_Y||||_RREE_C_O_GG_NN_II_Z_ED|||HHISS|||_ER_RRR_ORR||||

_ER_RRR_ORR
_ER_R_OR
ERROR

BRION SAW SOMETHING CLOSE TO PANIC ON HIS OPPONENT'S FACE WHEN THE MAN FINALLY RECOGNIZED HIS ERROR

the encoder inputs a log-mel spectrogram and encodes that spectrogram to form a sequence of encoder hidden states. outptut of encoder is passed into transformer decoder using a mechanism called cross-attention. this is like self attention but attends over encoder output. decoder predicts a sequence of text tokens in an autoregressive manner, single token at a time. at each timestep the previous output seq is fed back into decoder as input until it predicts end token. decoders attention is causal - decoder isnt allowed to look into the future.

whisper is trained on 700,000 hours of labelled data. Whisper is inhernetyly designed to work with 30 second samples. anything shorter than 30s is padded to 30s with silence. anything longer than 30s is truncated to 30s. memory in transformer scales with sequence length squared. so passing super long audio files leads to an out-of-memory (OOM) error. we process long-form transcriptions by chunking the input audio into smaller, manageable segments. each segment has a overlap with previous one. we stitch the segments back together at boundaries. stitching is done after we have transcribed all the chunks. it doesnt matter which order we transcrive the chunks in, cuz its stateless. chunks are measured in seconds.

is a pretrained model for languages other than english. can synthesize speech in over 1,100 languages. VITS is a speech generation network that converts text to raw speech waveforms. first acousitc features are generated, the waveform is then decoded using transposed convolutional layers. during inference, the text encodings are upsampled and transformed into waveforms. there is no need for a vocoder.

Audio Transformers

Introduction

Audio data basics

Representing Audio data

Types of Audio related models

Audio classification Architectures

classification types

Spectrogram based classification

CTC (connectionists temporal classification) Loss

CTC Architectures

CTC algorithm

Transformer architectures for audio

Model inputs

Model outputs

Whats idea behind STFT?

What does a Vocoder do?

Seq2Seq architectures

Automatic speech recognition

Text-to-speech

Long form transcriptions

SpeechT5

Speaker embeddings

HiFi-GAN

BARK

Massive Multilingual Speech (MMS)

Speech-to-speech translation

Transcribe a meeting