I am working on a project (Emotion detection from speech or voice tone) for features i am using MFCC which i understand to some extent and know that they are very important feature when it comes to speech.
This is the code i am using from librosa to extract features from my audio files which i am then using in Neural Network for training:
dat, sample_rate = librosa.load(audio_path,res_type='kaiser_fast')
mfccs = np.mean(librosa.feature.mfcc(y=dat, sr=sample_rate,n_mfcc=13).T, axis=0)
What i want to know is that how does taking the average of Mel Frequency coefficients after taking the transpose effects the performance? am i loosing valuable information from my audio file? or should i use the entire Mel Frequency coefficients for training and do some padding technique to make sure the size of the training feature remains the same accross all training audio files as they are of different lengths.
I also looked at other techniques e.g taking the derivatives of mfccs and joining them together but i am still not sure what technique can provide better feature set and provide better classification results eventually.
If these two techniques are not that useful then maybe i should stick with my current approach as shown in the code i.e to take the average and maybe increase my Mel Frequency coefficients number from 13 to higher number.