How to make torchaudio and librosa MFCC calculations equivalent?

Question

I've seen this question concerning the same type of issue between librosa, python_speech_features and tensorflow.signal.

I am trying to make torchaudio and librosa compute MFCC features with the same arguments and underlying methods. This is part of a transition from librosa to torchaudio.

Given:

import numpy as np
import torch

from librosa.feature import mfcc
from torchaudio.transforms import MFCC

sample_rate = 22050
audio = np.ones((sample_rate,), dtype=np.float32)

librosa_mfcc = mfcc(audio, sr=sr, n_mfcc=20, n_fft=2048, hop_length=512, power=2)

mfcc_module = MFCC(sample_rate=sr, n_mfcc=20, melkwargs={"n_fft": 2048, "hop_length": 512, "power": 2})
torch_mfcc = mfcc_module(torch.tensor(audio))

The shapes of librosa_mfcc and torch_mfcc are both (20, 44), but the arrays themselves are different. For example, librosa_mfcc[0][0] is -487.6101, while torch_mfcc[0][0] is -302.7711.

I admit I am lacking a good amount of domain knowledge here, but am working through the librosa and torchaudio documentation and parameters to learn the different routes they take in MFCC calculation as well as the meaning behind each parameter. How do I make torch_mfcc have the same values as librosa_mfcc?

Differences are likely to be on the mel-spectrogram level, as that calculation is a key part of MFCC. So compare the parameters for thos. In librosa, make sure to check at least fmin, fmax, htk. In the mel spectrogram it may be possible to plot and reason about the differences as well — Jon Nordby, Sep 22 '20 at 19:07
@jonnor I've tried standardizing the `fmin` and `fmax`, along with `htk` as both `true` and `false`. I still see differences in the array, could the implementation also potentially differ at the mel filter / STFT level? Looks like `torchaudio` doesn't provide ability to set kwargs for those, whereas `librosa` does. Will also try plotting like you suggested. — Mario Ishac, Sep 23 '20 at 17:15
yes I think mel/STFT level differences can happen for sure. After the mels spectrogram is computed, there is "just" log transformation and then DCT-II to get MFCC. A good half-way point to determine where in the chain the errors start, at least — Jon Nordby, Sep 23 '20 at 19:44
Not sure of MFCC, but i tried to match the output of MelSpectrogram. You can find the configuration and the code [here](https://colab.research.google.com/drive/10Iex_6WlQfEiIzT4oIZ0M_AocA_mv5rf?usp=sharing) Hope this helps. — stonelazy, Jul 15 '22 at 12:26

How to make torchaudio and librosa MFCC calculations equivalent?

0 Answers0