python - audio classification of equal length samples / 'vocoder' thingy

Question

Anybody able to supply links, advice, or other forms of help to the following?

Objective - use python to classify 10-second audio samples so that I afterwards can speak into a microphone and have python pick out and play snippets (faded together) of closest matches from db.

My objective is not to have the closest match and I don't care what the source of the audio samples is. So the result is probably of no use other than speaking in noise (fun).

I would like the python app to be able to find a specific match of FFT for example within the 10 second samples in the db. I guess the real-time sampling of the microphone will have a 100 millisecond buffersample.

Any ideas? FFT? What db? Other?

What is meant by "closest match"? Would a man and a woman saying the same word but with very different voice pitch be a close match? Or if one person spoke two different words, each having three syllables, would that be a close match? — TJD, Nov 29 '11 at 22:58

Jason Sundram · Answer 1 · 2011-11-29T21:57:20.730

In order to do this, you need three things:

Segmentation (decide how to make your audio samples)
Feature Extraction (decide what audio feature (e.g. FFT) you care about)
Distance Metric (decide what the "closest" sample is)

Segmentation: you currently describe using 10-second samples. I think you might have better results with shorter segments (closer to 100-1000ms) in order to get something that fits the changes in the voice better.

Feature Extraction: you mention using FFT. The zero crossing rate is surprisingly ok considering how simple it is. If you want to get more fancy, using MFCCs or spectral centroid is probably the way to go.

Distance Metric: most people use the euclidean distance, but there are also fancier ones like the manhattan distance, cosine distance, and earth-movers distance.

For a database, if you have a small enough set of samples, you might try just loading everything up into a kdtree so that you can do fast distance calculations, and just hold it in memory.

Good luck! It sounds like a fun project.

thanks for a lot of new words now buzzing in my head. will read up on the subjects and post back soon. ill also post a github url for the eventual source. — johannesgj, Nov 30 '11 at 20:03

score 0 · Answer 2 · answered Apr 17 '15 at 23:35

You could try some typical short-term feature extraction (e.g. energy, zero crossing rate, MFCCs, spectral features, chroma, etc) and then model your segment through a vector of feature statistics. Then you could use a simple distance-based classifier (e.g. kNN) in order to retrieve the "closest" training samples from a manually laballed set, given an unknown "query".

Check out my lib on several Python Audio Analysis functionalities: pyAudioAnalysis

score 0 · Answer 3 · answered Nov 29 '11 at 16:12

0

Try searching for algorithms on "music fingerprinting".

answered Nov 29 '11 at 16:12

hotpaw2

70,107
14
90
153

1

There's much that could be said about this, but it seems you have some idea of what you want to do. I would suggest just trying it out. You don't even need to start with the FFT, you might even start out with matching with euclidean distance on the raw signals (perhaps normalized by volume). Unlikely to work great, but it's a starting point. – dimatura Nov 29 '11 at 17:49
matching of raw signal would create a simpler but even crazy representation of noise speech. i like it :-) – johannesgj Nov 30 '11 at 20:04

python - audio classification of equal length samples / 'vocoder' thingy

3 Answers3