In speech recognition, the front end generally does signal processing to allow feature extraction from the audio stream. A discrete Fourier transform (DFT) is applied twice in this process. The first time is after windowing; after this Mel binning is applied and then another Fourier transform.
I've noticed however, that it is common in speech recognizers (the default front end in CMU Sphinx, for example) to use a discrete cosine transform (DCT) instead of a DFT for the second operation. What is the difference between these two operations? Why would you do DFT the first time and then a DCT the second time?
No comments:
Post a Comment