I am training an LSTM network in Python Tensorflow on audio data. My dataset is a bunch of wave files which read_wavfiles
turns into a generator of numpy
arrays. I decided to try training my network with the same dataset 20 times, and wrote some code as follows.
from with_hyperparams import stft
from model import lstm_network
import tensorflow as tf
def read_wavfile():
for file in itertools.chain(DATA_PATH.glob("**/*.ogg"),
DATA_PATH.glob("**/*.wav")):
waveform, samplerate = librosa.load(file, sr=hparams.sample_rate)
if len(waveform.shape) > 1:
waveform = waveform[:, 1]
yield waveform
audio_dataset = Dataset.from_generator(
read_wavfile,
tf.float32,
tf.TensorShape([None]))
dataset = audio_dataset.padded_batch(5, padded_shapes=[None])
iterator = tf.data.Iterator.from_structure(dataset.output_types,
dataset.output_shapes)
dataset_init_op = iterator.make_initializer(dataset)
signals = iterator.get_next()
magnitude_spectrograms = tf.abs(stft(signals))
output, loss = lstm_network(magnitude_spectrograms)
train_op = tf.train.AdamOptimizer(1e-3).minimize(loss)
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
for i in range(20):
print(i)
sess.run(dataset_init_op)
while True:
try:
l, _ = sess.run((loss, train_op))
print(l)
except tf.errors.OutOfRangeError:
break
The full code, including the sufficiently free data (Wikipedia sound files with IPA transcriptions) used, is on github.
The non-free data (EMU corpus sound files) does make a significant difference, though I am not sure how to show it to you:
- When running the script on the whole dataset, the output starts in iteration 0 with a loss of about 5000, which then decreases over the dataset to about 1000. Then comes the line with
1
indicating the second loop, and suddenly loss is at about 5000 again. - When swapping the order to
DATA_PATH.glob("**/*.wav"), DATA_PATH.glob("**/*.ogg")
the loss starts at below 5000 and goes down to about 1000, before jumping up to 4000 again for the*.ogg
samples.
Re-ordering the samples gives me a different result, so it looks like the WAV files are more similar to each other than the OGG files. I have a notion that shuffling should ideally happen at the level of the dataset, and not rely on it being read in random order. However, that would mean reading a lot of wav files into memory, which does not sound like a good solution.
What should my code look like?