Iterate over an audio file with Python's librosa

Question

I was trying to use a voice emotion detecton model on github HERE. Based on their examples, I was able to implement the following code to predict the final emotion of an audio file as a single prediction. Looks like it makes sub-predictions for each 0.4s window in the audio file, and then takes the maximum occurance as the final output (here is the sample file I used).

How can I change it to print a prediction for every 1s chunk of the audio file (as opposed to a single value for the whole file)?

import numpy as np
import pandas as pd
import librosa
from tqdm import tqdm
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Conv2D, MaxPool2D, Flatten, Dropout, Dense
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
import pickle

# Create a configuration class to help if I want to change parameters later

class Config:
    def __init__(self, n_mfcc = 26, n_feat = 13, n_fft = 552, sr = 22050, window = 0.4, test_shift = 0.1):
        self.n_mfcc = n_mfcc
        self.n_feat = n_feat
        self.n_fft = n_fft
        self.sr = sr
        self.window = window
        self.step = int(sr * window)
        self.test_shift = test_shift
        self.shift = int(sr * test_shift)
        
config = Config()
model = pickle.load(open('cnn_ep25_mfccOnly_moreData.pkl', 'rb'))
wav, sr = librosa.load('YAF_chain_angry.wav')

all_results = []

# Initialize a local results list
local_results = []

# Initialize min and max values for each file for scaling
_min, _max = float('inf'), -float('inf')

# Load the file
# Create an array to hold features for each window

X = []
# Iterate over sliding 0.4s windows of the audio file
for i in range(int((wav.shape[0]/sr-config.window)/config.test_shift)):
    X_sample = wav[i*config.shift: i*config.shift + config.step] # slice out 0.4s window
    X_mfccs = librosa.feature.mfcc(X_sample, sr, n_mfcc = config.n_mfcc, n_fft = config.n_fft,
                                    hop_length = config.n_fft)[1:config.n_feat + 1] # generate mfccs from sample
    
    _min = min(np.amin(X_mfccs), _min)
    _max = max(np.amax(X_mfccs), _max) # check min and max values
    
    X.append(X_mfccs) # add features of window to X

# Put window data into array, scale, then reshape
X = np.array(X)
X = (X - _min) / (_max - _min)
X = X.reshape(X.shape[0], X.shape[1], X.shape[2], 1)

# Feed data for each window into model for prediction
for i in range(X.shape[0]):
    window = X[i].reshape(1, X.shape[1], X.shape[2], 1)
    local_results.append(model.predict(window))

# Aggregate predictions for file into one then append to all_results
local_results = (np.sum(np.array(local_results), axis = 0)/len(local_results))[0]
local_results = list(local_results)
prediction = np.argmax(local_results)
    
# Turn all results into a dataframe
df_cols = ['neutral', 'happy', 'sad', 'angry', 'fearful', 'disgusted', 'surprised']
print(df_cols)
print(local_results)

print("Prediction: "+ df_cols[prediction])

Could you chunk up slices of the audio array, a bit like this https://stackoverflow.com/a/1751478/8876321 — fdcpp, Nov 14 '20 at 08:56
I couldn't understand the concept. Can you apply it to the code? What I was thinking was to add another for loop. — Tina J, Nov 14 '20 at 15:03
It will not be a straight translation as there will be some ifs and buts given the fact your sound wave unlikely to devisable by exactly 1 second's worth of samples. So long as no-one has beaten me to it, I can see about writing up an example tomorrow morning — fdcpp, Nov 14 '20 at 20:14
But even currently the wave is not divisible by 1s and they do it. Idk maybe last remainder we can just caculate as is. — Tina J, Nov 14 '20 at 20:37
The last window is likely just short of the time period, you would need to make a call on wether to process it with another frame or to shorten the analysis frame itself — fdcpp, Nov 14 '20 at 21:59

Iterate over an audio file with Python's librosa

0 Answers0