Wednesday, January 25, 2017

discrete signals - Help calculating / understanding the MFCCs: Mel-Frequency Cepstrum Coefficients


I've been reading bits and pieces online but I just can't piece it all together. I have some background knowledge of signals / DSP stuff which should be enough prerequisites for this. I'm interested in eventually coding this algorithm in Java but I don't understand it fully yet which is why I'm here (it counts as math, right?).


Here's how I think it works along with the gaps in my knowledge.




  1. Start with your audio speech sample, say a .wav file, that you can read into an array. Call this array $x[n]$, where $n$ ranges from $0, 1, \ldots ,N-1$ (so $N$ samples). The values correspond to audio intensity I guess - amplitudes.





  2. Split the audio signal into distinct "frames" of 10ms or so where you assume the speech signal is "stationary". This is a form of quantization. So if your sample rate is 44.1KHz, 10ms is equal to 441 samples, or values of $x[n]$.




  3. Do a Fourier transform (FFT for computation's sake). Now is this done on the entire signal or on each separate frame of $x[n]$? I think there's a difference because in general the Fourier transform looks at all elements of a signal, so $\mathcal F(x[n]) \neq \mathcal F(x_1[n])$ joined with $\mathcal F(x_2[n])$ joined with $\ldots \mathcal F(x_N[n])$ where $x_i[n]$ are the smaller frames. Anyways, say we do some FFT and end up with $X[k]$ for the rest of this.




  4. Mapping to the Mel scale, and logging. I know how to convert regular frequency numbers to the Mel scale. For each $k$ of $X[k]$ (the "x-axis" if you'll allow me), you can do the formula here: http://en.wikipedia.org/wiki/Mel_scale. But how about the "y-values" or the amplitudes of $X[k]$? Do they just remain the same values but shifted to the appropriate spots on the new Mel (x-) axis? I saw in some paper there was something about logging the actual values of $X[k]$ because then if $X[k] = A[k]*B[k]$ where one of those signals is presumed to be noise you don't want, the log operation on this equation turns the multiplicative noise into additive noise, which hopefully can be filtered (?).





  5. Now the final step is to take a DCT of your modified $X[k]$ from above (however it ended up getting modified). Then you take the amplitudes of this final result and those are your MFCCs. I read something about throwing away high frequency values.




So I'm trying to really iron out how to compute these guys step by step, and clearly some things are eluding me from above.


Also, I've heard about using "filter banks" (an array of band pass filters basically) and don't know if this refers to making the frames from the original signal, or maybe you make the frames after the FFT?


Lastly, there's something I saw about MFCCs having 13 coefficients?



Answer



Step by step...


1. & 2. This is correct. Note that the frames are usually overlapping, for example, frame 0 is samples 0 to 440; frame 1 is samples 220 to 660; frame 2 is samples 440 to 880 and so on... Note also that a window function is applied to the samples in the frame.



3. The Fourier transform is done for each frame. The motivation behind this is simple: a speech signal varies over time, but is stationary over short segments. You want to analyze each short segment individually - because on this segments the signal is simple enough to be described efficiently by few coefficients. Think of someone saying "hello". You don't want to see all phonemes collapsed into one single spectrum (FFT collapses temporal information) by analyzing all the sound at once. You want to see "hhhhheeeeeeeeeeelloooooooooo" to recognize the word stage by stage, so it has to be broken down into short segments.


4. "Mapping to the Mel scale" is misleading and that's probably why you're getting confused. A better description for this step would be: "Compute the signal energy through a bank of filters tuned to mel-scaled frequencies". Here is how this is done. We consider $N$ frequencies (a commonly used value is $N = 40$) equally spaced according to the mel scale, between 20 Hz (bottom of the hearing range) and the Nyquist frequency. Practical example: The signal is sampled at 8kHz and we want 40 bins. Since 4kHz (Nyquist) is 2250 mel, the filterbank center frequencies will be: 0 mel, 2250 / 39 mel, 2 x 2250 / 39 mel .. 2250 mel.


Once these frequencies have been defined, we compute a weighted sum of the FFT magnitudes (or energies) around each of these frequencies.


Look at the following picture, representing a filter bank with 12 bins:


Mel frequency filterbank with 12 channels


The 8th bin has a center frequency of around 2kHz. The energy in the 8th bin is obtained by summing weighted FFT energies in the 1600 to 2800 Hz approximately - with the weight peaking at around 2kHz.


Implementation note: This bunch of weighted sums can be done in a single operation - a matrix multiplication of a "filterbank matrix" by the FFT energies vector.


So at this stage we have "summarized" the FFT spectrum into a set of 40 (12 in the illustration) energy values, each corresponding to a different range of frequencies. We take the log of these values.


5. The next step consists in taking the DCT of this sequence of 40 log-energies. This will yield 40 values. The first $K$ coefficients are the MFCC (Usually, $K = 13$). Actually, the very first DCT coefficient is the sum of all the log-energies computed at the previous step - so it is an overall measure of signal loudness and is not very informative as to the actual spectral content of the signal - it is often discarded for speech recognition or speaker id applications where the system has to be robust to loudness variations.


No comments:

Post a Comment

periodic trends - Comparing radii in lithium, beryllium, magnesium, aluminium and sodium ions

Apparently the of last four, $\ce{Mg^2+}$ is closest in radius to $\ce{Li+}$. Is this true, and if so, why would a whole larger shell ($\ce{...