Understanding the output of mfcc

Question

from librosa.feature import mfcc
from librosa.core import load

def extract_mfcc(sound):
    data, frame = load(sound)
    return mfcc(data, frame)


mfcc = extract_mfcc("sound.wav")

I would like to get the MFCC of the following sound.wav file which is 48 seconds long.

I understand that the data * frame = length of audio.

But when I compute the MFCC as shown above and get its shape, this is the result: (20, 2086)

What do those numbers represent? How can I calculate the time of the audio just by its MFCC?

I'm trying to calculate the average MFCC per ms of audio.

Any help is appreciated! Thank you :)

this might help: http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/ — tkhurana96, Sep 08 '18 at 07:16

Lukasz Tracewski · Accepted Answer · 2018-09-08T14:52:16.627

14

That's because mel-frequency cepstral coefficients are computed over a window, i.e. number of samples. Sound is wave and one cannot derive any features by taking a single sample (number), hence the window.

To compute MFCC, fast Fourier transform (FFT) is used and that exactly requires that length of a window is provided. If you check librosa documentation for mfcc you won't find this as an explicit parameter. That's because it's implicit, specifically:

length of the FFT window: 2048
number of samples between successive frames: 512

They are passed as **kwargs and defined here.

If you now take into account sampling frequency of your audio and these numbers. you will arrive at the final result you have provided.

Since the default sampling rate for librosa is 22050, audio length is 48s and window equals 512, here's what follows:

The number is not exactly 2086, as:

Your audio length isn't exacatly 48 seconds
The actual window length is 2048, with 512 hop. That means you will "loose" a few frames at the end.

edited Sep 08 '18 at 14:52

answered Sep 08 '18 at 13:57

Lukasz Tracewski

10,794
3
34
53

I am glad you have found it helpful! 20 is a number of coefficients you extract. That's the default. – Lukasz Tracewski Sep 08 '18 at 19:41
1

Just want to clarify that you actually don't "lose frames" at the beginning and end with the default `center=True`. You gain frames because the frames are padded to fit your window length. If you set `center=False`, then nMFCC * hop_len <= num_samples. But with default `center=True`, then nMFCC * hop_len >= num_samples. – Mike Martin Aug 06 '20 at 00:09
What do the numbers mean exactly? You have 2048 and 20 of what? – Sam Jan 23 '21 at 06:00
@Sam That's 2048 samples and 20 Mel-frequency cepstral coefficients. – Lukasz Tracewski Jan 23 '21 at 06:33
I meant 2067, sorry – Sam Jan 23 '21 at 06:44
@Sam Number of coefficients. – Lukasz Tracewski Jan 23 '21 at 13:37

Understanding the output of mfcc

1 Answers1