API to break voice into phonemes / synthesize new speech given speech samples?

Question

You know those movies where the tech geeks record someone's voice, and their software breaks it into phonemes? Which they can then use to type in any phrase, and make it seem as if the target is saying it?

Does that software exist in an API Version? I don't even know what to Google.

+1 just for [zoom…enhance](http://tvtropes.org/pmwiki/pmwiki.php/Main/EnhanceButton)-level absurdity in a legitimate question. — Jon Purdy, Aug 11 '11 at 02:00
There isn't commercially available technology that can mimic someone else's voice to a recognizable degree. There are plenty of text-to-speech synthesis software available. Bing text-to-speech — Raj More, Aug 11 '11 at 02:13
@Phonon, Please don't. I am seriously interested in doing something like this. (Only for entertainment purposes - I'm trying to dub a movie...) — AShelly, Aug 11 '11 at 19:07
@AShelly Its a long time but can you explain how you have done this(Breaking voice into phonemes) as i have to do so,Please. — Kundan, Mar 25 '14 at 06:19
I never implemented anything - it's still on my list of 'someday' projects. The 'modeltalker' software linked below looks promising, as does 'eduspeak'. — AShelly, Mar 25 '14 at 13:30

score 14 · Accepted Answer · answered Aug 11 '11 at 02:14

There is no such software. Breaking arbitrary speech into its constituent phonemes is only a partially solved problem: speech-to-text software is still imperfect, as is text-to-speech.

The idea is to reproduce the timbre of the target's voice. Even if you were able to segment the audio perfectly, reordering the phonemes would produce audio with unnatural cadence and intonation, not to mention splicing artifacts. At that point you're getting into smoothing, time-scaling, and pitch correction, all of which are possible and well-understood in theory, but operate poorly on real-world data, especially when the audio sample in question is as short as a single phoneme, and further when the timbre needs to be preserved.

These problems are compounded on the phonetic side by allophonic variation in sounds based on accent and surrounding phonemes; in order to faithfully produce even a low-quality approximation of the audio, you'd need a detailed understanding of the target's language, accent, and speech patterns.

Furthermore, your ultimate problem is one of social engineering, and people are not easy to fool when it comes to the voices of people they know. Even with a large corpus of input data, at best you could get a short low-quality sample, hardly enough for a conversation.

So while it's certainly possible, it's difficult; even if it existed, it wouldn't always be good enough.

+1 for pointing out that the holes in the existing APIs. The movies always make it look like a off the shelf solution — stimpy, Aug 11 '11 at 02:22
+1 for the detailed answer. The social engineering aspect is not really important - I'm not actually trying to fool anyone. After reading a bit more, what I mostly want is something like Auto-tune. Are there any open-source implementaions of that technology? — AShelly, Aug 11 '11 at 19:36
Your best shot at accomplishing this task would be with a neural network, but getting the right data to train it with would be the hard part. — devinbost, Apr 11 '17 at 17:55
@devinbost: Yeah, I think recent advances in style transfer make it quite possible. The original question assumes some kind of corpus of the target’s voice. — Jon Purdy, Apr 11 '17 at 22:40

score 5 · Answer 2 · answered Jan 31 '14 at 18:11

SRI International (the company that created Siri for iOS) has an SDK called EduSpeak, which will take audio input and break it down into individual phonemes. I know this because I sat through a demo of the product about a week ago. During the demo, the presenter showed us an application that was created using the SDK. The application gave a few lines of text for the presenter to read. After reading the text, the application displayed a bar chart where each bar represented a phoneme from his speech. The height of each bar represented a score of how well each phoneme was pronounced (the presenter was not a native English speaker, so he received lower scores on certain phonemes compared to others). The presenter could also click on each individual bar to have only that individual phoneme played back using the original audio.

So yes, software exists that divides audio up by phoneme, and it does a very good job of it. Now, whether or not those phonemes can be re-assembled into speech is an open question. If we end up getting a trial version of the SDK, I'll try it out and let you know.

score 3 · Answer 3 · answered Aug 11 '11 at 06:27

3

If your aim is to mimic someone else's voice, then another attitude is to convert your own voice (instead of assembling phonemes). It is (surprisingly) called voice conversion, e.g http://www.busim.ee.boun.edu.tr/~speech/projects/Voice_Conversion.htm

answered Aug 11 '11 at 06:27

Itamar Katz

9,544
5
42
74

Thank you, this is actually very close to something I could use for the project I am thinking of. (Unfortunately it seems to still closer to the academic phase than to practical application) – AShelly Aug 12 '11 at 23:33

score 2 · Answer 4 · answered Sep 14 '17 at 19:10

2

Lyrebird is a start-up that is working on this very problem. Given samples of a person's voice and some written text, it can synthesize a spoken version of that written text in the voice of the person in the samples.

answered Sep 14 '17 at 19:10

Nathan Wailes

9,872
7
57
95

I checked the link you putted in. That makes a literally astonishing outputs. I used to both of Google and Amazon. But no one can compete to them if the product provides more language options. – Fredric Cliver Apr 17 '21 at 13:11

stimpy · Answer 5 · 2011-08-11T02:23:52.970

2

The technology is called "voice synthesis" and "voice recognition"

The java API for this can be found here Java voice JSAPI

Apple has an API for this Apple speech

Microsoft has several ...one is discussed here Vista speech

edited Aug 11 '11 at 02:23

answered Aug 11 '11 at 02:08

stimpy

492
1
6
18

The challenge here isn't really in voice synthesis nor voice recognition, but voice transformation. – Speedy Aug 17 '11 at 16:40

score 1 · Answer 6 · answered Aug 06 '12 at 23:46

1

I dunno about a commercially available solution, but the concept isn't entirely out of the range of possibility. For example, the University of Delaware has fairly decent software for doing just that.

http://www.modeltalker.com

answered Aug 06 '12 at 23:46

PacoBell

11
1

score 1 · Answer 7 · answered Aug 13 '11 at 20:41

You can get interesting voice warping effects with a formant-aware pitch shift. Adobe Audition has a pretty good implementation. Antares produces some interesting vocal effects VST plugins.

These techniques use some form of linear predictive coding (LPC) to treat the voice as a source-filter model. LPC works on speech signals by estimating the resonance of the vocal tract (formant), reversing its effect with an inverse filter, and then coding the resulting residual signal. The residual signal is ideally an impulse train that represents the glottal impulse. This allows the scaling of pitch and formants independently, which leads to a much better gender conversion result than simple pitch shifting.

API to break voice into phonemes / synthesize new speech given speech samples?

7 Answers7