0

As an educational project in machine learning, I was thinking of creating a voice identification system from scratch. It should be able to identify a speaker from his / her voice after being trained on his / her voice previously.

What approach should I take in tackling this challenge? Specifically, how would such a system work at a high level?

Any advice would be appreciated :)

Chetan
  • 46,743
  • 31
  • 106
  • 145

2 Answers2

2

To use your machine learning algorithm, you must first define the features you are going to feed it.

The easiest thing to do would be to compute the Fourier Transform of the audio signal (with any FFT tool you want, it's pretty standard), and build a feature vector with the information on frequencies and their amplitude.

If it's not enough, you could use a spectrogram to add temporal informations.

Once the features are correctly set, you can start playing with your favorite classifier algorithm !!!

If you use python, I found this question explaining how to do the FFT part : FFT for Spectrograms in Python

Community
  • 1
  • 1
bendaizer
  • 1,235
  • 9
  • 18
  • Would the spectrogram be a good set of features that can invariantly distinguish the speaker? What aspects of the spectrogram are unique to each speaker? – Chetan Feb 28 '13 at 21:05
  • I'm not really an expert, but it seems that the frequencies are usually sufficient to distiguish between people, because they "bear the trace" of the vocal folds that they were generated from. The spectrogram would help to distiguish between the modulation over time of the frequencies, thus allowing to discrimate between a human speaking and another source of noise, like music for instance – bendaizer Mar 01 '13 at 13:32
  • You can see for instance how some letters are "mapped to spectrograms : http://home.cc.umanitoba.ca/~krussll/phonetics/acoustic/spectrogram-sounds.html – bendaizer Mar 01 '13 at 13:36
1

I made a simple speaker identification once.

You would want to use features such as Mel frequency cepstral coefficients (MFCC), which account for periodicity in spectrum, due to harmonics, and for loudness as sensed by human ear.

Then you can cluster features in learning phase, to get a statistical model. I used VQ for this, which is quite horrible for this specific use, but still got usable result. In identification phase, you then attempt to fit input data on different models, which represent different speakers. The better the fit, the lower the error. Be sure, to normalize score against recording length.

Also, a good way to improve speaker identification is to exclude silence and non-speech sounds.

hruske
  • 2,205
  • 19
  • 27