MFCC Python: completely different result from librosa vs python_speech_features vs tensorflow.signal

Question

I'm trying to do extract MFCC features from audio (.wav file) and I have tried python_speech_features and librosa but they are giving completely different results:

audio, sr = librosa.load(file, sr=None)

# librosa
hop_length = int(sr/100)
n_fft = int(sr/40)
features_librosa = librosa.feature.mfcc(audio, sr, n_mfcc=13, hop_length=hop_length, n_fft=n_fft)

# psf
features_psf = mfcc(audio, sr, numcep=13, winlen=0.025, winstep=0.01)

Below are the plots:

librosa:

python_speech_features:

Did I pass any parameters wrong for those two methods? Why there's such a huge difference here?

Update: I have also tried tensorflow.signal implementation, and here's the result:

The plot itself matches closer to the one from librosa, but the scale is closer to python_speech_features. (Note that here I calculated 80 mel bins and took the first 13; if I do the calculation with only 13 bins, the result looks quite different as well). Code below:

stfts = tf.signal.stft(audio, frame_length=n_fft, frame_step=hop_length, fft_length=512)
spectrograms = tf.abs(stfts)

num_spectrogram_bins = stfts.shape[-1]
lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, 7600.0, 80
linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
    num_mel_bins, num_spectrogram_bins, sr, lower_edge_hertz, upper_edge_hertz)
mel_spectrograms = tf.tensordot(spectrograms, linear_to_mel_weight_matrix, 1)
mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:]))

log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)
features_tf = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrograms)[..., :13]
features_tf = np.array(features_tf).T

I think my question is: which output is closer to what MFCC actually looks like?

Interesting. Does passing in the samplerate to the psf version as kwarg, i.e. `samplerate=sr`, make a difference? Or changing the `dct_type` in the librosa version? — Hendrik, Mar 02 '20 at 16:21
@Hendrik psf version already has `sr` as input. In terms of `dct_type`, there's some changes when I set to 3, but still very far from the psf outputs (1 and 2 are almost identical). — TYZ, Mar 02 '20 at 16:25
@Hendrik Also tried tensorflow, it's closer to librosa but the scale is so different. — TYZ, Mar 02 '20 at 16:44
"Does passing in the samplerate to the psf version as kwarg, i.e. `samplerate=sr`, make a difference?" -> I meant, you are passing it as positional argument, but it is a keyword argument. Not sure, if passing it as kwarg argument makes a difference here, but I'd adhere to the (explicit) API. — Hendrik, Mar 02 '20 at 16:50

score 20 · Accepted Answer · answered Mar 02 '20 at 18:16

There are at least two factors at play here that explain why you get different results:

There is no single definition of the mel scale. Librosa implement two ways: Slaney and HTK. Other packages might and will use different definitions, leading to different results. That being said, overall picture should be similar. That leads us to the second issue...
python_speech_features by default puts energy as first (index zero) coefficient (appendEnergy is True by default), meaning that when you ask for e.g. 13 MFCC, you effectively get 12 + 1.

In other words, you were not comparing 13 librosa vs 13 python_speech_features coefficients, but rather 13 vs 12. The energy can be of different magnitude and therefore produce quite different picture due to the different colour scale.

I will now demonstrate how both modules can produce similar results:

import librosa
import python_speech_features
import matplotlib.pyplot as plt
from scipy.signal.windows import hann
import seaborn as sns

n_mfcc = 13
n_mels = 40
n_fft = 512 
hop_length = 160
fmin = 0
fmax = None
sr = 16000
y, sr = librosa.load(librosa.util.example_audio_file(), sr=sr, duration=5,offset=30)

mfcc_librosa = librosa.feature.mfcc(y=y, sr=sr, n_fft=n_fft,
                                    n_mfcc=n_mfcc, n_mels=n_mels,
                                    hop_length=hop_length,
                                    fmin=fmin, fmax=fmax, htk=False)

mfcc_speech = python_speech_features.mfcc(signal=y, samplerate=sr, winlen=n_fft / sr, winstep=hop_length / sr,
                                          numcep=n_mfcc, nfilt=n_mels, nfft=n_fft, lowfreq=fmin, highfreq=fmax,
                                          preemph=0.0, ceplifter=0, appendEnergy=False, winfunc=hann)

As you can see the scale is different, but overall picture looks really similar. Note that I had to make sure that a number of parameters passed to the modules is the same.

Thank you for looking into this! I tried to add all the parameters one by one to see which one is causing the difference and I found that it was `appendEnergy`, after adding that, the plot looks quite similar. In terms of the scale, I guess it won't matter too much if we do a scaling after then. — TYZ, Mar 02 '20 at 18:28
@TYZ True. That being said, considering what you guys are doing, I'd consider going well beyond MFCC. — Lukasz Tracewski, Mar 02 '20 at 23:05
Do you have any suggestions on what are the other options? I'm currently reading on fairseq's wav2vec but would be definitely open to more suggestions. — TYZ, Mar 03 '20 at 14:23
@TYZ Since it's going on a tangent to the original question, I will contact you directly. — Lukasz Tracewski, Mar 03 '20 at 14:54

Novak · Answer 2 · 2020-03-31T19:15:16.467

This is the sort of thing that keeps me up at night. This answer is correct (and extremely useful!) but not complete, because it does not explain the wide variance between the two approaches. My answer adds a significant extra detail but still does not achieve exact matches.

What's going on is complicated, and best explained with a lengthy block of code below which compares librosa and python_speech_features to yet another package, torchaudio.

First, note that torchaudio's implementation has an argument, log_mels whose default (False) mimics the librosa implementation, but if set True will mimic python_speech_features. In both cases, the results are still not exact, but the similarities are obvious.
Second, if you dive into the code of torchaudio's implementation, you will see the note that the default is NOT a "textbook implementation" (torchaudio's words, but I trust them) but is provided for Librosa compatibility; the key operation in torchaudio that switches from one to the other is:

    mel_specgram = self.MelSpectrogram(waveform)
    if self.log_mels:
        log_offset = 1e-6
        mel_specgram = torch.log(mel_specgram + log_offset)
    else:
        mel_specgram = self.amplitude_to_DB(mel_specgram)

Third, you'll be wondering quite reasonably if you can force librosa to act correctly. The answer is yes (or at least, "It looks like it") by taking the mel spectrogram directly, taking the nautral log of it, and using that, rather than the raw samples, as the input to the librosa mfcc function. See the code below for details.
Finally, have some caution, and if you use this code, do examine what happens when you look at different features. The 0th feature still has severe unexplained offsets, and the higher features tend to drift away from each other. This may be something as simple as different implementations under the hood or slightly different numerical stability constants, or it might be something that can be fixed with fine tuning, like a choice of padding or perhaps a reference in a decibel conversion somewhere. I really don't know.

Here is some sample code:

import librosa
import python_speech_features
import matplotlib.pyplot as plt
from scipy.signal.windows import hann
import torchaudio.transforms
import torch

n_mfcc = 13
n_mels = 40
n_fft = 512 
hop_length = 160
fmin = 0
fmax = None
sr = 16000

melkwargs={"n_fft" : n_fft, "n_mels" : n_mels, "hop_length":hop_length, "f_min" : fmin, "f_max" : fmax}

y, sr = librosa.load(librosa.util.example_audio_file(), sr=sr, duration=5,offset=30)

# Default librosa with db mel scale 
mfcc_lib_db = librosa.feature.mfcc(y=y, sr=sr, n_fft=n_fft,
                                    n_mfcc=n_mfcc, n_mels=n_mels,
                                    hop_length=hop_length,
                                    fmin=fmin, fmax=fmax, htk=False)

# Nearly identical to above
# mfcc_lib_db = librosa.feature.mfcc(S=librosa.power_to_db(S), n_mfcc=n_mfcc, htk=False)

# Modified librosa with log mel scale (helper)
S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels, fmin=fmin, 
                                    fmax=fmax, hop_length=hop_length)

# Modified librosa with log mel scale
mfcc_lib_log = librosa.feature.mfcc(S=np.log(S+1e-6), n_mfcc=n_mfcc, htk=False)

# Python_speech_features 
mfcc_speech = python_speech_features.mfcc(signal=y, samplerate=sr, winlen=n_fft / sr, winstep=hop_length / sr,
                                          numcep=n_mfcc, nfilt=n_mels, nfft=n_fft, lowfreq=fmin, highfreq=fmax,
                                          preemph=0.0, ceplifter=0, appendEnergy=False, winfunc=hann)

# Torchaudio 'textbook' log mel scale 
mfcc_torch_log = torchaudio.transforms.MFCC(sample_rate=sr, n_mfcc=n_mfcc, 
                                            dct_type=2, norm='ortho', log_mels=True, 
                                            melkwargs=melkwargs)(torch.from_numpy(y))

# Torchaudio 'librosa compatible' default dB mel scale 
mfcc_torch_db = torchaudio.transforms.MFCC(sample_rate=sr, n_mfcc=n_mfcc, 
                                           dct_type=2, norm='ortho', log_mels=False, 
                                           melkwargs=melkwargs)(torch.from_numpy(y))

feature = 1 # <-------- Play with this!!
plt.subplot(2, 1, 1)

plt.plot(mfcc_lib_log.T[:,feature], 'k')
plt.plot(mfcc_lib_db.T[:,feature], 'b')
plt.plot(mfcc_speech[:,feature], 'r')
plt.plot(mfcc_torch_log.T[:,feature], 'c')
plt.plot(mfcc_torch_db.T[:,feature], 'g')
plt.grid()

plt.subplot(2, 2, 3)
plt.plot(mfcc_lib_log.T[:,feature], 'k')
plt.plot(mfcc_torch_log.T[:,feature], 'c')
plt.plot(mfcc_speech[:,feature], 'r')
plt.grid()

plt.subplot(2, 2, 4)
plt.plot(mfcc_lib_db.T[:,feature], 'b')
plt.plot(mfcc_torch_db.T[:,feature], 'g')
plt.grid()

Quite honestly, none of these implementations are satisfying:

Python_speech_features takes the inexplicably bizarre approach of replacing the 0th feature with energy rather than augmenting with it, and has no commonly used delta implementation
Librosa is non-standard by default with no warning, and lacks an obvious way to augment with energy, but has a highly competent delta function elsewhere in the library.
Torchaudio will emulate either, also has a versatile delta function, but still has no clean, obvious way to get energy.

score 0 · Answer 3 · answered Dec 15 '22 at 21:15

Regarding the difference to tf.signal, and for anyone still looking for this: I had a similar problem some time ago: Matching librosa's mel filterbanks/mel spectrogram to a tensorflow implementation. The solution was to use a different windowing approach for the spectrogram and librosa's mel matrix as constant tensor. See here and here.

MFCC Python: completely different result from librosa vs python_speech_features vs tensorflow.signal

3 Answers3

Linked