18

How do I convert any sound signal to a list phonemes?

I.e the actual methodology and/or code to go from a digital signal to a list of phonemes that the sound recording is made from.
eg:

lPhonemes = audio_to_phonemes(aSignal)

where for example

from scipy.io.wavfile import read
iSampleRate, aSignal = read(sRecordingDir)

aSignal = #numpy array for the recorded word 'hear'
lPhonemes = ['HH', 'IY1', 'R']

I need the function audio_to_phonemes

Not all sounds are language words, so I cannot just use something that uses the google API for example.

Edit
I don't want audio to words, I want audio to phonemes. Most libraries seem to not output that. Any library you recommend needs to be able to output the ordered list of phonemes that the sound is made up of. And it needs to be in python.

I would also love to know how the process of sound to phonemes works. If not for implementation purposes, then for interest sake.

Marcus Müller
  • 34,677
  • 4
  • 53
  • 94
Roman
  • 8,826
  • 10
  • 63
  • 103

3 Answers3

23

Accurate phoneme recognition is not easy to achieve because phonemes themselves are pretty loosely defined. Even in good audio the best possible systems today have about 18% phoneme error rate (you can check LSTM-RNN results on TIMIT published by Alex Graves).

In CMUSphinx phoneme recognition in Python is done like this:

from os import environ, path

from pocketsphinx.pocketsphinx import *
from sphinxbase.sphinxbase import *

MODELDIR = "../../../model"
DATADIR = "../../../test/data"

# Create a decoder with certain model
config = Decoder.default_config()
config.set_string('-hmm', path.join(MODELDIR, 'en-us/en-us'))
config.set_string('-allphone', path.join(MODELDIR, 'en-us/en-us-phone.lm.dmp'))
config.set_float('-lw', 2.0)
config.set_float('-beam', 1e-10)
config.set_float('-pbeam', 1e-10)

# Decode streaming data.
decoder = Decoder(config)

decoder.start_utt()
stream = open(path.join(DATADIR, 'goforward.raw'), 'rb')
while True:
  buf = stream.read(1024)
  if buf:
    decoder.process_raw(buf, False, False)
  else:
    break
decoder.end_utt()

hypothesis = decoder.hyp()
print ('Phonemes: ', [seg.word for seg in decoder.seg()])

You need to checkout latest pocketsphinx from github in order to run this example. Result should look like this:

  ('Best phonemes: ', ['SIL', 'G', 'OW', 'F', 'AO', 'R', 'W', 'ER', 'D', 'T', 'AE', 'N', 'NG', 'IY', 'IH', 'ZH', 'ER', 'Z', 'S', 'V', 'SIL'])

See also the wiki page

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
  • Hi Nikolai. I'm having trouble installing the latest version, and created another question. Would you please take a look?: http://stackoverflow.com/questions/30728041/install-pocketsphinx-for-python2-7 – Roman Jun 09 '15 at 09:32
  • @Nikolay Shmyrev Could you elaborate reg `MODELDIR` ? what goes there? thanks. – oba2311 Nov 19 '17 at 08:51
  • MODELDIR is the location of the model. It can be anywhere on the system, it depends on where have you put the files. – Nikolay Shmyrev Nov 21 '17 at 11:44
  • Hi @NikolayShmyrev, can I use an audio file like `goforward.m4a` instead of `goforward.raw`. I've tried and resulted as an empty list. If it requires `.raw` file, could you please suggest how to convert an audio file to a raw one. – Protocole Jun 26 '19 at 12:51
  • 1
    No, you have to convert it. You can convert to wav with `ffmpeg: ffmpeg -i goforward.m4a -ar 16000 -ac 1 goforward.wav` and use goforward.wav then – Nikolay Shmyrev Jun 26 '19 at 21:24
  • Thanks a lot @NikolayShmyrev. Could you please also suggest that how they gonna assess these phonemes with the right pronouncing? Suppose I want to assess the audio phonemes `['SIL', 'G', 'OW', 'F', 'AO', 'R', 'W', 'ER', 'D', 'T', 'AE', 'N', 'NG', 'IY', 'IH', 'ZH', 'ER', 'Z', 'S', 'V', 'SIL']` with the right one `['G', 'OW', 'F', 'AO', 'R', 'W', 'ER', 'D', 'T', 'EH', 'N', 'M', 'IY', 'T', 'ER', 'Z']` taken from [The CMU Pronouncing Dictionary](http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=go+forward+ten+meters). – Protocole Jun 27 '19 at 05:30
  • 1
    https://cmusphinx.github.io/wiki/pocketsphinx_pronunciation_evaluation/ – Nikolay Shmyrev Jun 27 '19 at 18:09
11

I need to create the function audio_to_phonemes

You're basically saying:

I need to re-implement 40 years of speech recognition research

You shouldn't be implementing this yourself (unless you're about to be a professor in the field of speech recognition and have a revolutionary new approach), but should be using one of the many existing frameworks. Have a look at sphinx / pocketsphinx!

Marcus Müller
  • 34,677
  • 4
  • 53
  • 94
  • 1
    I really don't want to reinvent the wheel, I want to implement existing software. `audio_to_phonemes` is supposed to do that. Though I can't find anything that just takes the sound and and hands me phonemes and stops there. I don't want *voice recognition*. I want audio to phonemes. – Roman Jun 08 '15 at 10:28
  • 3
    Since that's part of the job: look at existing voice recognition software frameworks. They're not monolithic, but have speech models that give them lists of phonemes, usually (of course depending on the individual framework). – Marcus Müller Jun 08 '15 at 10:59
  • 1
    you can definitely extract phonemes rather than words in kaldi. – lCapp Jun 08 '15 at 11:37
  • @lCapp, do you know how to get them out of kaldi? I haven't installed and looked through kaldi yet, but it would help to know more – Roman Jun 08 '15 at 14:25
9

Have a look at Allosaurus, a universal (~2000 lang) phone recognizer to give you IPA phonemes. On a sample wave file, I did downloaded the latest model and tried this in Python3.

$ python -m allosaurus.bin.download_model -m latest
$ python -m allosaurus.run -i sample.wav
æ l u s ɔ ɹ s
ruoho ruotsi
  • 1,283
  • 14
  • 13