I've seen this question concerning the same type of issue between librosa
, python_speech_features
and tensorflow.signal
.
I am trying to make torchaudio
and librosa
compute MFCC features with the same arguments and underlying methods. This is part of a transition from librosa
to torchaudio
.
Given:
import numpy as np
import torch
from librosa.feature import mfcc
from torchaudio.transforms import MFCC
sample_rate = 22050
audio = np.ones((sample_rate,), dtype=np.float32)
librosa_mfcc = mfcc(audio, sr=sr, n_mfcc=20, n_fft=2048, hop_length=512, power=2)
mfcc_module = MFCC(sample_rate=sr, n_mfcc=20, melkwargs={"n_fft": 2048, "hop_length": 512, "power": 2})
torch_mfcc = mfcc_module(torch.tensor(audio))
The shape
s of librosa_mfcc
and torch_mfcc
are both (20, 44)
, but the arrays themselves are different. For example, librosa_mfcc[0][0]
is -487.6101
, while torch_mfcc[0][0]
is -302.7711
.
I admit I am lacking a good amount of domain knowledge here, but am working through the librosa
and torchaudio
documentation and parameters to learn the different routes they take in MFCC calculation as well as the meaning behind each parameter. How do I make torch_mfcc
have the same values as librosa_mfcc
?