Audio Transformers
Introduction
Audio data basics
sound is continous data, contains infinite signal values in a given time. converted into .wav .flac ,mp3 digital representation of audio signals.
phonemes: smallest units of sound that distinguish one word from another. e.g. english only has 44 phenomes, japanese has around 22. Many ASR models transcribe speech into phonemes before converting them into words.
sampling rate is the number of samples taken in one second and measured in hertz. sampling speech at 16hz is suffiicient for human understanding
Amplitude represents sound pressure level and measured in decibels. Bit depth defines how many discrete amplitude values are avaliable in a digital audio file. Higher bit depth allows finer resolution, reducing quantization noise and improving dynamic range.
Representing Audio data
waveform: plots sampling values against time. popular module is librosa.

Frequency spectrum plots frequency spectrum of audio signal.

Spectogram plots frequency content of audio as it changes over time, allows you to see time. freq, amplotude all on one graph.
mel spectogram approximates non-linear frequency response of human ear, A Mel spectrogram compresses high frequencies while maintaining detail in lower frequencies. since humans can distinguish low frequencies better than high frequencies.

Types of Audio related models
Audio classification Architectures
classification types
Spectrogram based classification
spectrogram is a 2D tensor of shape freq, seq length. its just like an image. If we pass it to an CNN classifier, we will get very good predictions. Audio spectrogram transformer (AST) uses self-attention layers so model is better able to capture global context. AST model splits audio spectrogram into a sequence of overlapping image patches of 16x16 pixels. these patches are projected into embedded space and given to transformer encoder as input. AST is an encoder only transformer

Note: shifting contents of image up or down doesnt changes meaning of an image, shifting spectrogram up or down will change frequencies.
CTC (connectionists temporal classification) Loss
loss function used in seq-seq tasks where input and output have different lengths and alignment is unknown. widely used in ASR and handwriting recognition. CTC instead of forcing model to predict exact alignments, sums over all valid alignments.

CTC Architectures
the encoder reads the input sequence and maps this into sequence of hidden sates aka output embeddings. we additionally apply linear mapping on sequence of generated hidden states to get class label predicitons. we get one class label prediciton for each hidden state. class labesl are characters of alphabet.
problem is alignment: we dont know how the characters in the transcirption line up to the audio.
if our input is one-second audio file, the ASR model first downasamples the audio input using CNN features encoder to shorter seequence of hidden statesl, where there is one hidden state vector for every 20ms of audio, we forward 50 hidden states to the transformer encoder. audio segments extracted are partially overlapped, even though one hidden state is emitted every 20ms, each hidden-state actually represents 25ms of audio. Transformer encoder predicts one feature representtiaon for eahc hidden statesm each of the output has a dimensionality of 768, the output shape is 768, 50. each fo these predicctions covers 25ms which is shorter that the duration ofa pehoneme.

to make text predicctions we map each ofthe 768 dim encoder outputs to our character labels using a linear CTC head. model then predicts a 50, 32 tensor containing logits where 32 is the number of tokens in the vocabulary.
B_R_II_O_N_||_S_AWW_|||||_S_OMEE_TH_ING_||_C_L_O_S_E||TO|_P_A_N_I_C_||_ON||HHI_S||_OP_P_O_N_EN_T_'SS||_F_AA_C_E||_W_H_EN||THE||M_A_NN_||||_F_I_N_AL_LL_Y||||_RREE_C_O_GG_NN_II_Z_ED|||HHISS|||_ER_RRR_ORR||||
if we simply predict one character every 20ms, our output sequence might look like this. many charaacters have been duplicated. CTC is a way to filter out these duplicates.
CTC algorithm
CTC uses special token called blank token. ‘_’ is a blank token represents no token output. ‘|’ token is word separator character and tells where the word breaks are. look at last example from the seq
_ER_RRR_ORR
_ER_R_OR
ERROR
if we apply ‘|’ to entire sequence and replace it with space, we would get
BRION SAW SOMETHING CLOSE TO PANIC ON HIS OPPONENT'S FACE WHEN THE MAN FINALLY RECOGNIZED HIS ERROR
one downside of CTC is it may output words that sound correct but are not spelled correctly
most of the models work exactly like this.
Transformer architectures for audio

for audio tasks, input/output sequences can be audio instead of text. in ASR models the input is speech the output is text. in TTS the input is text the output is speech. in Audio classification the input is speech the output is class probability. in Voice conversation, the input and output are audio.
waveform (blue color) is a one dimensional sequence of floating point numbers. each number represents sampled ampitude at a given time.
Model inputs
raw waveforms are normalized to zero mean and unit variance.
after normalizing the audio seq is turned into an embedding using small convolutional neural network aka feature encoder. outptus a 512 dimensional vector.
one downside of using raw waveform as input is that they have long sequence lengths. 30s of 16Hz gives an input length = 480,000 more computation. if we use a spectrogram, we get the same amount of information but in a more compressed form.

whisper model converts the waveform into a log-mel spectrogram. whisper always splits audio into 30-seconds segments. each log-mel spectrogram has a shape 80,3000. where 80 is the number of mel bins and 3000 is sequence length. log-mel spectrogram is processed by a small CNN into a sequence of embeddings.
Model outputs
transformer outputs a sequnce of hidden state vectors. goal is to transform these vectors into a text or audio output. text is predicted by adding a single linear layer followed by a softmax on top of transformers output.
for TTS/generative models that produces audio, we have to add layers that can produce audio sequence. its common to generate a spectrogram and then use an additional neural network aka Vocoder to turnn this waveform into a waveform. in speechT5 TTS model output from transformer network is a 768 dim vector. linear layer projects that sequence into long-mel spectrogram.
a Post-Net model is made of some linear layers and convolutional layers refines the spectrogram by reducing noise, the vocoder then makes the final audio waveform.
if we take an existing waveform and apply STFT its possible to perform the inverse operation ISTFT to get the original waveform back. Audio models that generate outptu as a spectrogram only predicts the amplitude information and not the phase. to turn a spectrogram into a waveform, we have to estimate the phase information thats what a vocoder does.
phase: property of a wave that describes its position in time relative to other waves. if we only use amplitude and ignore phase, we get noise. many AI models only predict amplitude spectrogram, vocoders estimate missing phase
Whats idea behind STFT?
short time fourier transform analyses how the frequency content of a signal changes over time. instead of treating entire signal as one block, (like a normal fourier transform), STFT slides a window over the signal and computes fourier transform in small time segments. STFT applies FT at ech instance and stacks all these to form a spectrogram. STFT outputs magnitude and phase information
What does a Vocoder do?
when generating audio, vocoder is used when we dont have a phase information. since most models predict only the amplitude (spectrogram) we need a way to guess the missing phase to reconstruct the waveform.
essential infromation needed for reconstructing audio:
the vocoder estimates the phase information using Griffen-Lim/ WaveNet/Hi-Fi-GAN and synthesizes the final waveform from the spectrogram using ISTFT algorithm.
if we ignore the phase information and directly apply
Seq2Seq architectures
in encoder only models, the input waveform was downsampled and there was one prediction for every 20ms of audio. with a seq2seq, there is no one-one correspondence and input output sequences can have diffrerent lengths. this makes encoder-decoder model suitable for audio tasks.
Automatic speech recognition
the encoder inputs a log-mel spectrogram and encodes that spectrogram to form a sequence of encoder hidden states. outptut of encoder is passed into transformer decoder using a mechanism called cross-attention. this is like self attention but attends over encoder output. decoder predicts a sequence of text tokens in an autoregressive manner, single token at a time. at each timestep the previous output seq is fed back into decoder as input until it predicts end token. decoders attention is causal - decoder isnt allowed to look into the future.

tokens predicted by whisper are full words, it uses tokenizer from GPT-2 and has 50k+ unique tokens. a seq-seq model can therefore output a much shorter sequence than a CTC model for same transcription. Loss function used is cross-entropy loss. this is usually combined with beam search to genereate final sequence.
Text-to-speech
transformer encoder takes in text tokens and extracts a sequence of hidden states that represent the input text. transformer decoder applies cross-attention to encoder output and predcits spectrogram.
one problem with TTS is that there are many possible speech sounds to which the input text can be mapped to. different speakers may choose to emphasize different parts of the sentence, this makes TTS models hard to evaluate. because of this L1 or MSE loss isnt actually very meaningful. this is why TTS models are evaluated by human listeners using mean opinion score metric.
encodee-decoder model is also slower as decoding process happens one step at a time. the longer the sequence the slower the prediction.
Long form transcriptions
whisper is trained on 700,000 hours of labelled data. Whisper is inhernetyly designed to work with 30 second samples. anything shorter than 30s is padded to 30s with silence. anything longer than 30s is truncated to 30s. memory in transformer scales with sequence length squared. so passing super long audio files leads to an out-of-memory (OOM) error. we process long-form transcriptions by chunking the input audio into smaller, manageable segments. each segment has a overlap with previous one. we stitch the segments back together at boundaries. stitching is done after we have transcribed all the chunks. it doesnt matter which order we transcrive the chunks in, cuz its stateless. chunks are measured in seconds.
SpeechT5
this transformer based model from microsoft can hanfle multiple speech tasks. this transformer has six modal specific pre-nets and post-nets. speecht5 is designed to jointly learn text and speech representations. can do T-S, S-T, S-S, T-T.
speechT5 is first pre-trained using large scale unlabeled speech and text data. during pre-training phase all pre-nets and post-nets are used simultaneously.
pre-net is a small NN that processes the input before it enters the main model
post-net is a small CNN that refines the final output of the model
text encoder pre-net: converts text tokens into representation that can be feeded into encoder
speech decoder pre-net: takes a mel-spectrogram as input and uses linear layers to compress spectrogram into representation.
speech decoder post-net: predicts a residual to add to output spectrogram and is used to refine the results.

Speaker embeddings
speaker embeddings is a method of representing a speakers identity in a compact way, these embeddings capture speakers voice, accent, intonation. these embeddings can be used for speaker verification, speaker diarization, speaker identification. there are two techniques for generating speaker embeddings.
I-Vectors (identity vectors): based on Gaussian mixture model (GMM). they represent speakers as fixed low dim vectors derived from statistical properties of a speakers voice, obtained in unsupervised manner. I-vectors are not robust against bg noise.
X-Vectors: derived using neural nets and capture frame level speaker information by incorporating temporal context. first the raw speech is converted into spectral features then short speech segments are processed independantly. this is more robust to noise and can handle variable length speech and still outputs a fixed size vector.
HiFi-GAN
is a SOTA GAN based vocoder designed for high-fidelity speech synthesis. its capable of generating high quality audio waveforms from spectrogram inputs. consits of one generator and two discriminators. generator is a CNN that takes mel-spectrogram and learns to produce raw audio waveforms. two discriminators are:
BARK
eliminates the need for vocoder during inference, generates raw speech waveforms directly. uses encodec which serves as codec and compression tool. audio is compressed into lightweight representation using 8 codebooks. each codebook consists of integer vectors. first few codebooks refine the audio, later codebooks refine the aduio improving quality and realism.
bark consists of four models working together to generate speech from text.
Massive Multilingual Speech (MMS)
is a pretrained model for languages other than english. can synthesize speech in over 1,100 languages. VITS is a speech generation network that converts text to raw speech waveforms. first acousitc features are generated, the waveform is then decoded using transposed convolutional layers. during inference, the text encodings are upsampled and transformed into waveforms. there is no need for a vocoder.
Speech-to-speech translation
S-S is an extension to traditional MT task, we translate speech from one language into another. we can use an end-end model or a pipelined model,

adding more components to pipeline leads to error propagation where errors introduced in one system are compounded, and latency is increased. ASR + MT + TTS was used to power many commercial STST products including google translate.
Transcribe a meeting
speaker diarization - task of taking an unlabelled audio input and predicting who spoke when. in doing so we can predict start / end timestamps for each speaker turn
