I'm trying to understand how speech recognition works, when using HMM. I know this post is very long, but that's mainly because I'm a complete beginner, and I since I have no idea what is clear and what is not, I try to write everything as clearly as possible (which takes a lot of space). Help would be really appreciated.
I will be using O for observation sequence, X for state sequence, M for a HMM model with defined parameters.
My understanding - please correct me where I am wrong.
If I understand it correctly, the 3 basic operations with HMMs are: 1, Evaluation - how likely is observation O to happen, given a HMM model M - P(O|M). This is done by using the forward algorithm (dynamic programming).
2, Classification - I'm not sure I understand what this one is - is it finding out which state sequence is the most likely one to cause the observation O, given a model M? If yes, then it it is again done with dynamic programming, very similar to the forward algorithm, except you will be taking maximums instead of sums.
3, Training - You change the parameters of M, to raise the probability of an observation O happening with your given model. The forward-backward algorithm can be used for this, and it is proven that it finds a local maximum.
The actual speech recognition, here in the context of isolated word recognition:
For each word you want to recognize, you have a specific HMM model - here, the example I will use is that if you want to recognize 'one', 'two', 'three' you will have 3 HMM models M_1, M_2, M_3, each with say 3 states and 3 emmissions (that might be enough for 3 words?). You make training samples of the specific words, and you train the models with these sounds - you do this, by breaking up each utterance/sound of a certain word into small parts (~10ms), to each part you map a certain vector, with the vector being based on some spectral quality of this 10ms part - this is the observation at a certain time.
Now, if you want to recognize a word, you say the word, and you pick such a model M_i, that it is the maximum of P(O|M_i), i element of L, L being the number of words (and specific HMM models) that you have trained.
What I don't get:
What exactly can the vector/observation be? Can it be a relatively simple function, say approximation of the spectral envelope? Or is it usually a vector of R^n?
How is it, that if you say "ooone" instead of "one", P("ooone"|M_1) will still be the highest of the three trained HMM models? Is it because if the model was trained with 'one', the model already favors the observation that happens when the sound of 'o' is said, and the state transition of x_1 -> x_1 (where x_1 is likely to emit this observation) is also favored? Basically, is there some convincing proof/intuition, that pronouncing a certain word very slowly at certain parts will not cause it to be recognized incorrectly?
In the case of isolated word recognition, does the classification HMM operation happen at all? All I see is training, and evaluation (for finding the maximum of P(O|M_i).
Lastly, could some please explain how continuous speech recognition differs from this? Apparently, unlike the above, classification happens at some point - I don't really understand how it can be used. I can imagine that maybe for each phoneme, a certain HMM model will be used, and then in a long sentence, this sentence will consist of many of these models tied together. How many HMM models will there be, how many states will there be? How will it be done, that a certain state sequence will define the words in the utterance?
Answer
What exactly can the vector/observation be? Can it be a relatively simple function, say approximation of the spectral envelope? Or is it usually a vector of R^n?
You won't go very far by summarizing the entire spectral envelope of a sound by a single real, so yes, observations are vectors - their dimensionality is in the 10 to 100 components bucket...
Typically, the features used are MFCC and their first/second order derivatives. They capture spectral envelope (in short: dimensionality reduction applied to a low-resolution spectrum computed on a perceptual scale) and have several invariance properties that make them relatively robust.
How is it, that if you say "ooone" instead of "one", P("ooone"|M_1) will still be the highest of the three trained HMM models? Is it because if the model was trained with 'one', the model already favors the observation that happens when the sound of 'o' is said, and the state transition of x_1 -> x_1 (where x_1 is likely to emit this observation) is also favored?
Yes, this is what happens.
The cost of stretching the "o" is high, but still lower than having the emission probabilities totally wrong by matching the "o"s or the "n" that follows with other phonemes.
Basically, is there some convincing proof/intuition, that pronouncing a certain word very slowly at certain parts will not cause it to be recognized incorrectly?
If the features used for the observation vector are robust, and if the model is trained on the same speaker as for recognition (or if suitable model adaptation measures are used), the emission probabilities can get very low when confusing phonemes.
One thing that could cause confusions would be a HMM trained on utterances of "five" and another trained on utterances of "fiiiiiiiiiiile". In this case, "fiiiiiiiiiiiive" might be recognized as "fiiiiiiiiiiiile" - the cost of repeatedly stretching the "i" might be enough to counterbalance the confusion from "v" and "l" on a handful of frames. So an abnormal utterance could be confused with another abnormal utterance for which a model is available. The thing is, abnormal utterances are "averaged out" during the training process.
does the classification HMM operation happen at all?
For recognition, it is not relevant to know the most likely sequence of states. However, there are applications for which knowing the sequence of state is needed.
In particular, we sometimes want to exactly synchronize the recognized sequence with the original audio recording. This could be useful for subtitling, or recording a timestamped transcription of a recording. This is also used during Viterbi training, an alternative to the Baum-Welch algorithm that uses state decoding (classification) in the training loop.
Lastly, could some please explain how continuous speech recognition differs from this?
Left-right models with a handful of states are used to describes diphones or triphones. States have hundreds to thousands mixture components (or nowadays we eschew GMMs and use other emission models). This is already a ridiculously large model space, so there are tied-states - that is to say some states in the model share the same emission probabilities with other states in the model. Anything that pools together the parameters from several parts of the model ensures that more training data will be used.
Then, the triphone models are concatenated together to build word models - using pronunciation dictionaries (lexicons). The lexicon maps a word into a sequence of phones. While there are approaches to automatically learn it, this data primarily comes from linguists.
Then word models are concatenated together to build a language model. The result is an extremely big FST, which is not even fully composed in memory from its component - but only traversed on the fly during recognition.
Of course the whole thing is not trained with Baum-Welch... We train separately each level... Tri-phone HMMs are trained on aligned speech data; lexicons are curated by linguists; language models trained separately on purely textual data...
Typically a small amount of aligned speech data (in which the transcription are aligned with the audio) is used to bootstrap the triphones model. Then this inexact model can be used to align a larger set of unaligned training data, and the process is iterated.
No comments:
Post a Comment