Thursday, February 2, 2017

modulation - How to mimic/copy/fake someone's voice?


Is there any existing application to sample someone's voice and use it to modulate any other voice or synthesize a text to resemble the original one?


For example, this Text-to-Speech Demo by AT&T lets you choose a voice and a language from presets that I guess are based on some human voice that have been sampled.



How do you call this process? Is it voice modulation? Voice synthesis?



Answer



A first note: Most modern text-to-speech systems, like the one from AT&T you have linked to, use concatenative speech synthesis. This technique uses a large database of recordings of one person's voice uttering a long collection of sentences - selected so that the largest number of phoneme combinations are present. Synthesizing a sentence can be done just by stringing together segments from this corpus - the challenging bit is making the stringing together seamless and expressive.


There are two big hurdles if you want to use this technique to make president Obama say embarrassing words:



  • You need to have access to a large collection of sentences of the target voice, preferably recorded with uniform recording conditions and good quality. AT&T has a budget to record dozens of hours of the same speaker in the same studio, but if you want to fake someone's voice from just 5 mins of recording it will be difficult.

  • There is a considerable amount of manual alignment and preprocessing before the raw material recorded is in the right "format" to be exploited by a concatenative speech synthesis system.


Your intuition that this is a possible solution is valid - provided you have the budget to tackle these two problems.


Fortunately, there are other techniques which can work with less supervision and less data. The field of speech synthesis interested in "faking" or "mimicking" one voice from a recording is known as voice conversion. You have a recording A1 of target speaker A saying sentence 1, and a recording B2 of source speaker B saying sentence 2, you aim at producing a recording A2 of speaker A saying sentence 2, possibly with access to a recording B1 of speaker B reproducing with his/her voice the same utterance as the target speaker.



The outline of a voice conversion system is the following:



  1. Audio features are extracted from recording A1, and they are clustered into acoustic classes. At this stage, it is a bit like having bags will all "a" of speaker A, all "o" of speaker A, etc. Note that this is a much simpler and rough operation than true speech recognition - we are not interested in recognizing correctly formed words - and we don't even know which bag contains "o" and which bag contains "a" - we just know that we have multiple instances of the same sound in each bag.

  2. The same process is applied on B2.

  3. The acoustic classes from A1 and B2 are aligned. To continue with the bags analogy, this is equivalent to pairing the bags from step 1 and 2, so that all the sounds we have in this bag from speaker A should correspond to the sounds we have in that bag from speaker B. This matching is much easier to do if B1 is used at step 2.

  4. A mapping function is estimated for each pair of bags. Since we know that this bag contains sounds from speaker A, and that bag the same sounds but said by speaker B - we can find an operation (for example a matrix multiplication on feature vectors) that make them correspond. In other words, we now know how to make speaker 2's "o" sound like speaker 1's "o".

  5. At this stage we have all cards in hand to perform the voice conversion. From each slice of the recording of B2, we use the result of step 2. to figure out which acoustic class it corresponds to. We then use the mapping function estimated at step 4 to transform the slice.


I insist on the fact that this operates at a much lower level than performing speech recognition on B2, and then doing TTS using A1's voice as a corpus.


Various statistical techniques are used for steps 1 and 2 - GMM or VQ being the most common ones. Various alignment algorithms are used for part 2 - this is the trickiest part, and it is obviously easier to align A1 vs B1, than A1 vs B2. In the simpler case, methods like Dynamic Time Warping can be used to make the alignment. As for step 4, the most common transform are linear transforms (matrix multiplication) on feature vectors. More complex transforms make for more realistic imitations but the regression problem to find the optimal mapping is more complex to solve. Finally, as for step 5, the quality of resynthesis is limited by the features used. LPC are generally easier to deal with a simple transformation method (take signal frame -> estimate residual and LPC spectrum -> if necessary pitch-shift residual -> apply modified LPC spectrum to modified residual). Using a representation of speech that can be inverted back to the time domain, and which provide good separation between prosody and phonemes is the key here! Finally, provided you have access to aligned recordings of speaker A and B saying the same sentence, there are statistical models which simultaneously tackle steps 1, 2, 3 and 4 in one single model estimation procedure.



I might come back with a bibliography later, but a very good place to start to get a feel for the problem and the overall framework used to solve it is Stylianou, Moulines and Cappé's "A system for voice conversion based on probabilistic classification and a harmonic plus noise model".


There is to my knowledge no widely piece of software performing voice conversion - only software modifying properties of the source voice - like pitch and vocal tract length parameters (For example IRCAM TRAX transformer) - with which you have to mess in the hope of making your recording sound closer to the target voice.


No comments:

Post a Comment

periodic trends - Comparing radii in lithium, beryllium, magnesium, aluminium and sodium ions

Apparently the of last four, $\ce{Mg^2+}$ is closest in radius to $\ce{Li+}$. Is this true, and if so, why would a whole larger shell ($\ce{...