10

I am training an LSTM network in Python Tensorflow on audio data. My dataset is a bunch of wave files which read_wavfiles turns into a generator of numpy arrays. I decided to try training my network with the same dataset 20 times, and wrote some code as follows.

from with_hyperparams import stft
from model import lstm_network
import tensorflow as tf


def read_wavfile():
    for file in itertools.chain(DATA_PATH.glob("**/*.ogg"),
                                DATA_PATH.glob("**/*.wav")):
        waveform, samplerate = librosa.load(file, sr=hparams.sample_rate)
        if len(waveform.shape) > 1:
            waveform = waveform[:, 1]

        yield waveform    

audio_dataset = Dataset.from_generator(
    read_wavfile,
    tf.float32,
    tf.TensorShape([None]))

dataset = audio_dataset.padded_batch(5, padded_shapes=[None])

iterator = tf.data.Iterator.from_structure(dataset.output_types,
                                           dataset.output_shapes)
dataset_init_op = iterator.make_initializer(dataset)

signals = iterator.get_next()

magnitude_spectrograms = tf.abs(stft(signals))

output, loss = lstm_network(magnitude_spectrograms)

train_op = tf.train.AdamOptimizer(1e-3).minimize(loss)

init_op = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init_op)
    for i in range(20):
        print(i)
        sess.run(dataset_init_op)

        while True:
            try:
                l, _ = sess.run((loss, train_op))
                print(l)
            except tf.errors.OutOfRangeError:
                break

The full code, including the sufficiently free data (Wikipedia sound files with IPA transcriptions) used, is on github.

The non-free data (EMU corpus sound files) does make a significant difference, though I am not sure how to show it to you:

  • When running the script on the whole dataset, the output starts in iteration 0 with a loss of about 5000, which then decreases over the dataset to about 1000. Then comes the line with 1 indicating the second loop, and suddenly loss is at about 5000 again.
  • When swapping the order to DATA_PATH.glob("**/*.wav"), DATA_PATH.glob("**/*.ogg") the loss starts at below 5000 and goes down to about 1000, before jumping up to 4000 again for the *.ogg samples.

Re-ordering the samples gives me a different result, so it looks like the WAV files are more similar to each other than the OGG files. I have a notion that shuffling should ideally happen at the level of the dataset, and not rely on it being read in random order. However, that would mean reading a lot of wav files into memory, which does not sound like a good solution.

What should my code look like?

Anaphory
  • 6,045
  • 4
  • 37
  • 68
  • I am not really aware of the conventions for programming `tensorflow`. Feel free to edit my code snippet to make it conform to any such conventions and therefore easier to read for other users. – Anaphory Aug 19 '18 at 19:49
  • 1
    For starters, the global variable initialization is usually done within the scope of the `tf.Session`. Does moving `init_op = tf.global_variables_initializer()` within the `with tf.Session() as sess:` loop help? It is hard to debug without any data. – campellcl Aug 22 '18 at 09:23
  • For debugging, there is [TensorBoard](https://www.tensorflow.org/guide/summaries_and_tensorboard) and the [TensorFlow Debugger](https://www.tensorflow.org/guide/debugger). I am just starting to learn TensorBoard myself, but it seems like it has the potential to be quite helpful. – campellcl Aug 22 '18 at 09:28
  • 1
    It sounds like a problem with your dataset. Can you add code/a rough description of your `read_wavfile` function? If you can, using `Dataset`'s shuffle/batch/repeat methods are less error-prone than doing these common things yourself. Check [this answer](https://stackoverflow.com/questions/45828616/streaming-large-training-and-test-files-into-tensorflows-dnnclassifier/45829855#45829855) for details on that, or I might be more use if you post your generator function :). – DomJack Aug 22 '18 at 10:45
  • I have added the read_wavfile code and I'm just trying the other suggestions, both on a reduced dataset and on the original data. – Anaphory Aug 22 '18 at 13:32
  • That is an interesting problem. I noticed something similar when I trained my own recurrent network. After one dataset consumption, the loss increased again to a higher value than it was at the end of the last epoch. However, the effect was never as big as in your scenario. I never found the reason. Just as in your case, there was no order in the input data. – Merlin1896 Aug 23 '18 at 17:02
  • @Anaphory One thing to try: Replace the whole `Dataset` part with a `tf.placeholder` and a feeddict in `sess.run()`. This way we can figure out if the `Dataset`has anything to do with the issue. – Merlin1896 Aug 23 '18 at 20:01
  • Yes, it is a problem with my dataset. I have tried to adapt my question to reflect the situation as it really is, and I hope I have turned it into an answerable question. – Anaphory Aug 27 '18 at 11:47

2 Answers2

5

Please try this:

  • Add dataset.shuffle(buffer_size=1000) to the input pipeline.
  • Isolate the call to loss to evaluate after each training epoch.

As illustrated below:

Update to input pipeline

dataset = audio_dataset.padded_batch(5, padded_shapes=[None])
dataset = dataset.shuffle(buffer_size=1000)
iterator = tf.data.Iterator.from_structure(dataset.output_types,
                                           dataset.output_shapes)
dataset_init_op = iterator.make_initializer(dataset)
signals = iterator.get_next()

Update to Session

with tf.Session() as sess:
    sess.run(init_op)

    for i in range(20):
        print(i)
        sess.run(dataset_init_op)

        while True:
            try:
                sess.run(train_op)
            except tf.errors.OutOfRangeError:
                break

        # print loss for each epoch
        l = sess.run(loss)
        print(l)

If I have access to a few data samples, I might be able to help more precisely. For now, I'm working blind here, in any case, do let me know if this works.

Ekaba Bisong
  • 2,918
  • 2
  • 23
  • 38
  • But now only 20 samples are passed through the network. Before, the `for` loop represented the epochs and the `while` loop was responsible to feed all data from the dataset to the network in a single epoch. – Merlin1896 Aug 23 '18 at 17:08
  • Thanks, @Merlin1896. Posted that in a haste. I've updated my answer. The big idea here is to visualize the loss after each training epoch and not for each training example. Please try this out and let me know if the results and observations are different. Cheers. – Ekaba Bisong Aug 24 '18 at 09:43
  • @EkabaBisong You are right that my dataset is a problem and that shuffling helps. Adding information to the question unfortunately meant that your answer now looks slightly odd, sorry for that. – Anaphory Aug 27 '18 at 11:50
3

This looks like an issue in architecture. First, you are generating your data on the go, which despite being a commonly employed technique, is not always the most reasonable choice. This is because:

One of the downsides of Dataset.from_generator() is shuffling the resulting dataset with a shuffle buffer of size n requires n examples to be loaded. This will either create periodic pauses in your pipeline (large n) or result in potentially poor shuffling (small n).

It might be a good idea to convert your data into numpy arrays, and then store the numpy arrays on disk to use as your data set like so:

def array_to_tfrecords(X, y, output_file):
  feature = {
    'X': tf.train.Feature(float_list=tf.train.FloatList(value=X.flatten())),
    'y': tf.train.Feature(float_list=tf.train.FloatList(value=y.flatten()))
  }
  example = tf.train.Example(features=tf.train.Features(feature=feature))
  serialized = example.SerializeToString()

  writer = tf.python_io.TFRecordWriter(output_file)
  writer.write(serialized)
  writer.close()

This will take the Dataset.from_generator component out of the issue. The data can then be read with:

def read_tfrecords(file_names=("file1.tfrecord", "file2.tfrecord", "file3.tfrecord"),
                   buffer_size=10000,
                   batch_size=100):
  dataset = tf.contrib.data.TFRecordDataset(file_names)
  dataset = dataset.map(parse_proto)
  dataset = dataset.shuffle(buffer_size)
  dataset = dataset.repeat()
  dataset = dataset.batch(batch_size)
  return tf.contrib.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)

This should ensure your data is thoroughly shuffled and give better results.

Additionally, I believe that you would benefit from a little data preprocessing. For starters, try converting all the files in your dataset into a standardized WAVE form and then saving that data to a TFRecord. Currently you are converting them into WAVE and standardizing the sample rate with librosa, but that doesn't standardize the channels. Instead try using a function like:

from pydub import AudioSegment
def convert(path):

    #open file (supports all ffmpeg supported filetypes) 
    audio = AudioSegment.from_file(path, path.split('.')[-1].lower())

    #set to mono
    audio = audio.set_channels(1)

    #set to 44.1 KHz
    audio = audio.set_frame_rate(44100)

    #save as wav
    audio.export(path, format="wav")

Lastly, you might find that reading the sound files as floating points isn't in your best interests. You should consider trying something like:

import scipy.io.wavfile as wave
import python_speech_features as psf
def getSpectrogram(path, winlen=0.025, winstep=0.01, NFFT=512):

    #open wav file
    (rate,sig) = wave.read(path)

    #get frames
    winfunc=lambda x:np.ones((x,))
    frames = psf.sigproc.framesig(sig, winlen*rate, winstep*rate, winfunc)

    #Magnitude Spectrogram
    magspec = np.rot90(psf.sigproc.magspec(frames, NFFT))

    #noise reduction (mean substract)
    magspec -= magspec.mean(axis=0)

    #normalize values between 0 and 1
    magspec -= magspec.min(axis=0)
    magspec /= magspec.max(axis=0)

    #show spec dimensions
    print magspec.shape    

    return magspec

Then apply the functions like so:

#convert file if you need to
convert(filepath)

#get spectrogram
spec = getSpectrogram(filepath)

This will parse the data from the WAVE files into images, which you can then handle in the same way you would any image classification problem.

Philip DiSarro
  • 1,007
  • 6
  • 9