I'm trying to feed .wav files to a neural network in order to train it to detect what's being said. So I have around 10 000 .wav files and the transcription of the audio, but when I try to feed the CSV file to the neural network I get this error : ValueError: setting an array element with a sequence.
I'm using Soundfile to get the .wav data without the header and putting it into a list. I've tried other libraries too but the result was the same.
import os
import numpy as np
from tqdm import tqdm
import pandas as pd
import soundfile as sf
path = os.getcwd() + "/stft wav/"
audios = []
total = len(os.listdir(path))
pbar = tqdm(total = total)
for file in os.listdir(path):
data, sr = sf.read(path + file)
audios.append(data)
pbar.update(1)
pbar.close()
Then I read the file with the transcription and create the dataset that's going to be fed to the neural network.
dict = pd.read_csv("dictionary.csv", sep = '\t')
dataset = pd.DataFrame(columns = ['Audio', 'Word'])
dataset.Audio = audios
dataset.Word = dict.Romaji
The dataset now looks like this :
Audio Word
0 [-2.686136382767934e-11, 1.5804246800144028e-1... inshou
1 [5.0145061436523974e-09, 1.3923349584388234e-0... taishou
2 [-2.253151087927563e-08, 2.173326230092698e-08... genshou
3 [3.0560468644580396e-07, 1.0646554073900916e-0... kishou
4 [0.0, 2.499070395067804e-12, 1.206467304531999... chuushouteki
The arrays from the audio column don't have the same size, but I already tried padding them with zeros and the error message continues the same.
This is how I padded it in case you're wondering :
X = dataset.Audio.copy()
pbar = tqdm(total = len(X['Audio']))
for i in range(0, len(X['Audio'])):
X['Audio'][i] = np.resize(X['Audio'][i], len(max(X['Audio'], key = len)))
pbar.update(1)
pbar.close()
A weird thing I noticed is that when I save this CSV file and read it again the Audio column's float arrays are automatically converted into string arrays. The only way I found to keep it the way it should be is saving it as a pickle file.
Since we're at it, feel free to suggest other methods to feed the .wav files to the neural network. I'm trying to use this method instead of spectrograms because I read here that it's not a good idea.
Solution
I was looking into similar problems and found a simple and elegant solution. After the train-test split, when passing the audios' column to the neural network, use list(X)
instead of just X
.
About the CSV file converting the float array to string, it's because of the power notation. There's a letter in the middle of the numbers, so Pandas writes it as float, but reads it as string. As I said previously, saving the dataframe as a pickle file works, but it takes too long to read compared to saving the audios' column separately as a .npy file.