How to train a machine learning algorithm using MFCC coefficient vectors?

Question

For my final year project i am trying to identify dog/bark/bird sounds real time (by recording sound clips). I am using MFCC as the audio features. Initially i have extracted altogether 12 MFCC vectors from a sound clip using jAudio library. Now I'm trying to train a machine learning algorithm(at the moment i have not decided the algorithm but it is most probably SVM). The sound clip size is like around 3 seconds. I need to clarify some information about this process. They are,

Do i have to train this algorithm using frame based MFCCs(12 per frame) or or overall clip based MFCCs(12 per sound clip)?
To train the algorithm do i have to consider all the 12 MFCCs as 12 different attributes or do i have to consider those 12 MFCCs as a one attribute ?

These MFCCs are the overall MFCCS for the clip,

-9.598802712290967 -21.644963856237265 -7.405551798816725 -11.638107212413201 -19.441831623156144 -2.780967392843105 -0.5792847321137902 -13.14237288849559 -4.920408873192934 -2.7111507999281925 -7.336670942457227 2.4687330348335212

Any help will be really appreciated to overcome these problems. I couldn't find out a good help on Google. :)

could you please share your code if possible, I'm working on something similar. — kRazzy R, Jan 18 '18 at 23:57

score 5 · Accepted Answer · answered Feb 07 '16 at 14:27

5

You should calculate MFCCs per frame. Since your signal varies in time, taking them over whole clip would not make sense. Worse, you might end up with dog and bird having similar representation. I'd experiment with several frame lengths. In general, they will be in order of milliseconds.
All of them should be separate features. Let machine learning algorithm decide whichever are best predictors.

Mind that MFCCs are sensitive to noise, so do check first how your samples sound. Far richer selection of audio features for extraction is offered by e.g. Yaafe library, many of which will serve better in your case. Which specifically? Here's what I found most useful in classification of bird calls:

spectral flatness
perceptual spread
spectral rolloff
spectral decrease
spectral shape statistics
spectral slope
Linear Predictive Coding (LPC)
Line Spectral Pairs (LSP)

Perhaps you might find interesting to check-out this project, especially the part where I am interfacing with Yaafe.

Back in the days I used SVMs, exactly as you are planning. Today I would definitively go with gradient boosting.

answered Feb 07 '16 at 14:27

Lukasz Tracewski

10,794
3
34
53

this is really helpful, and also what is the suggesting window size for this case? normally i thought of like having a 3 second window because some birds have long sounds. and also when it comes to training, can u please explain me to how to create the matrix? so i have to create 12 different attributes cause all of them are separate features right? – nayakPan Feb 08 '16 at 02:54
Think of a window size as a shortest interval that holds information, a quantum of sound. In 3 seconds you can say a whole sentence. As explained in my answer: order of milliseconds, e.g. 16 ms. The windows should be overlapping, at least 50% or more. For examples how to "create a matrix" a refer you to the piece of code I shared. – Lukasz Tracewski Feb 08 '16 at 21:14
holds information mean complete sound what i need or a small part of the sound i need to recognise? lets say a particular bird's unique sound is 2 second of length. if i use window size of 1 second there is not a single chance of getting that birds actual sound is it? :( – nayakPan Feb 09 '16 at 06:48
I'd recommend reading some introduction to DSP first. Long story short, your window size should be at least a few times longer (e.g. 5) than the period of the sound - the lower the pitch, the longer window should be. However, the longer the window, the lower is your resolution. – Lukasz Tracewski Feb 09 '16 at 07:22

How to train a machine learning algorithm using MFCC coefficient vectors?

1 Answers1

Linked