Tuesday, July 18, 2017

speech - Using DTW on the MFCC's to match a spoken word to a template



I am trying to code a simple algorithm in MATLAB that would be used to detect a single word. The goal would be to have a user record the word one time to act as a template, then make the user repeat the same word to try and detect it. Searching on this website has answered many of my questions, but I still have some interrogations.


So far my algorithm calculates the MFCC's in the following way:



  • Extract the data from a .wav file (recorded at $8000\textrm{ Hz}$)

  • Build frames of $64\textrm{ ms}$ with a $50\%$ overlap, applying a Hann window

  • Calculate the FFT of all these frames

  • Calculate the Power Spectrum using the FFT results ($P(k) = \lvert X(k)\rvert^2$)

  • For each frame, extract 26 coefficients by multiplying the Power Spectrums with the 26 Mel filters and summing the results for a given Mel filter

  • Calculate the $\log_{10}$ of the coefficients

  • Apply a dct on a frame's 26 coefficients



My goal was to use the MFCC's of the template as well as the MFCC's of the repeated work and compare them using a DWT algorithm. My DWT algorithm is already programmed and functional.


However, my algorithm does not work very well and it seems like certain parameters affect the results quite a lot. Here are my questions:




  1. Are these steps enough to detect a spoken word?




  2. Would it be better to use a large number of fixed templates instead of having the user pre-record a template? Which method should result in better recognition performance?





  3. If the person repeats the word using a different distance from the microphone, the increased/decreased values of the MFCC's are enough to mess up the DTW results. Is there a smart way to normalize the MFCCs to try and cancel the effect of the microphone distance?




  4. Some websites recommend using only the 13 MFCC's with the smallest values.



    • Why is that?

    • Also, are they talking about the smallest magnitudes, or the smallest values?

    • Assuming 13 MFCC's are very big negative numbers, while 13 other MFCC's are small positive numbers, which set would I keep?





EDIT: Also, the first coefficient of each frame is always much bigger than the other coefficients. I would say it's magnitude is bigger my a factor of 100. Obviously, when I caculate the DTW using Euclidian distance, I'd say this coefficient is the only relevant one since it is so much bigger than all the other values. Should it be discarded?




No comments:

Post a Comment

periodic trends - Comparing radii in lithium, beryllium, magnesium, aluminium and sodium ions

Apparently the of last four, $\ce{Mg^2+}$ is closest in radius to $\ce{Li+}$. Is this true, and if so, why would a whole larger shell ($\ce{...