0

I am working on a project (Emotion detection from speech or voice tone) for features i am using MFCC which i understand to some extent and know that they are very important feature when it comes to speech.

This is the code i am using from librosa to extract features from my audio files which i am then using in Neural Network for training:

dat, sample_rate = librosa.load(audio_path,res_type='kaiser_fast')
mfccs = np.mean(librosa.feature.mfcc(y=dat, sr=sample_rate,n_mfcc=13).T, axis=0)

What i want to know is that how does taking the average of Mel Frequency coefficients after taking the transpose effects the performance? am i loosing valuable information from my audio file? or should i use the entire Mel Frequency coefficients for training and do some padding technique to make sure the size of the training feature remains the same accross all training audio files as they are of different lengths.
I also looked at other techniques e.g taking the derivatives of mfccs and joining them together but i am still not sure what technique can provide better feature set and provide better classification results eventually.
If these two techniques are not that useful then maybe i should stick with my current approach as shown in the code i.e to take the average and maybe increase my Mel Frequency coefficients number from 13 to higher number.

1 Answers1

0

I think averaging is a bad idea in this case. Because, yes - you loose valuable temporal information. But in context of emotion recognition it is more important that you suppress valuable parts of the signal by averaging with the background. It is well known than emotions are subtle phenomena that may appear only in a short period of time, being hidden the rest of the time.

Since your motivation is to prepare the audio signal for processing with a ML method, I should say that there are plenty of methods to do this properly. Shortly speaking, you process each MFCC frame independently (for example with DNN) and then somehow represent the entire sequence. See this answer for more details and links: How to classify continuous audio

To include static DNN into the dynamic context, combination of DNNs with hidden Markov models was quite popular. The classical paper describing the approach dates back in 2013: https://www.researchgate.net/publication/261500879_Hybrid_Deep_Neural_Network_-_Hidden_Markov_Model_DNN-HMM_based_speech_emotion_recognition

Nowadays, novel methods were developed, for example: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/IS140441.pdf

Given enough data (and skills) for training, you can employ some kind or recurrent neural network, that solves the sequence classification task by design.

Dmytro Prylipko
  • 4,762
  • 2
  • 25
  • 44
  • Thanks for your valuable answer. So as you suggested i would still be calculating the average of each Mel Frequency Coefficient but this time i will make a fixed size frame and calculate MFCCs on segments of audio. However, i still can't figure out what would be the better approach for making them a fixed size single vector for training because i will get different number of segments for different audio samples. I am thinking of joining the audio files together and save them with 10 sec length each to create a new dataset where they all have same length hence same number of segments. –  Feb 21 '21 at 14:01
  • I did not suggested averaging the vectors :) Instead, I suggested to employ methods suitable for sequence processing (such as HMMs or RNNs), since you do have sequences of MFCCs of variable length. – Dmytro Prylipko Feb 22 '21 at 07:38