7

I have a audio data set and each of them has different length. There are some events in these audios, that I want to train and test but these events are placed randomly, plus the lengths are different, it is really hard to build a machine learning system with using that dataset. I thought fixing a default size of length and build a multilayer NN however, the length's of events are also different. Then I thought about using CNN, like it is used to recognise patterns or multiple humans on an image. The problem for that one is I am really struggling when I try to understand the audio file.

So, my questions, Is there anyone who can give me some tips about building a machine learning system that classifies different types of defined events with training itself on a dataset that has these events randomly(1 data contains more than 1 events and they are different from each other.) and each of them has different lenghts?

I will be so appreciated if anyone helps.

Faruk
  • 2,269
  • 31
  • 42

2 Answers2

8

First, you need to annotate your events in the sound streams, i.e. specify bounds and labels for them.

Then, convert your sounds into sequences of feature vectors using signal framing. Typical choices are MFCCs or log-mel filtebank features (the latter corresponds to a spectrogram of a sound). Having done this, you will convert your sounds into sequences of fixed-size feature vectors that can be fed into a classifier. See this. for better explanation.

Since typical sounds have a longer duration than an analysis frame, you probably need to stack several contiguous feature vectors using sliding window and use these stacked frames as input to your NN.

Now you have a) input data and b) annotations for each window of analysis. So, you can try to train a DNN or a CNN or a RNN to predict a sound class for each window. This task is known as spotting. I suggest you to read Sainath, T. N., & Parada, C. (2015). Convolutional Neural Networks for Small-footprint Keyword Spotting. In Proceedings INTERSPEECH (pp. 1478–1482) and to follow its references for more details.

Dmytro Prylipko
  • 4,762
  • 2
  • 25
  • 44
  • Dou you have any python snippets for sampling and windowing `.wav` file. Audio files of the dataset is mono and the data is a vector with length of frame count and includes double numbers. Will i be able to get spectral analysis of mono wave? – Faruk Dec 09 '16 at 11:15
  • I made some more researh on the internet, and I think `LSTM` is the best option for me to train this kind data. But I still have a problem, I do understand what you talk about above but have no idea how I can fix the size of audio? – Faruk Dec 09 '16 at 11:42
  • 3
    If you want to write code in python, use librosa. It comes with numerous examples. But take into account that you need some basic knowledge of signal processing anyway. Learn Fourier transform, windowing functions, spectrum. – Dmytro Prylipko Dec 09 '16 at 11:47
  • 1
    You do not need to fix the size of audio. After feature extraction, your 1-D signal will be converted into 2-D matrix of feature vectors, where the first dimension will have fixed size (# of feats) and the second dimension will correspond to the length of the input sound. Concerning LSTMs - I would start from window-based classification with feed-forward network. You will get probability distribution over time, which you can easily post-process and find where a specific sound is detected. – Dmytro Prylipko Dec 09 '16 at 11:52
  • Then, after feature extraction I will have (n_features,n_frames) matrix and each column indicates the input data as I understood. But I am confused about LSTMs again. What do you mean by distribution? – Faruk Dec 09 '16 at 12:03
  • 2
    A NN usually produces kind of probability distribution over classes (using softmax layer). Having applied it to every frame, you will get a probability distribution for sound classes over time. See this paper for two-class example (using MFCC+SVM): http://mazsola.iit.uni-miskolc.hu/~czap/letoltes/IS14/IS2014/PDF/AUTHOR/IS140919.PDF – Dmytro Prylipko Dec 09 '16 at 12:55
  • Hi again, The labels are provided as xml files and inside of each, there are start and end seconds of events. The question is (I use librosa) `librosa.feature.mfcc` returns an matrix which is in shape of `n_mfcc` by `time`. For a sample that has the lenght of 186 seconds, time is set with the value of 8019. how can I transform those start and end seconds into the form of the `time`? – Faruk Dec 27 '16 at 15:11
  • That's a bit tricky. The `time` axis of the MFCC matrix is indeed the frame number. To convert frame number into the time value you first need to calculate sample number that corresponds to the beginning of the frame: `sample = frame_num * hop_length`. Given sample rate, `time (s) = sample/sample_rate`. `hop_length` is the parameter for librosa. See https://github.com/librosa/librosa/blob/master/librosa/core/time_frequency.py#L119 – Dmytro Prylipko Dec 27 '16 at 16:55
  • I think you missunderstood me. I need `time_to_sample`. Because as MFCC returns a matrix as in form (n_mfcc, sample), I need to get samples. For instance, in `xml`s there is start second (4.1512) and end second(5.14213). I need to get sample numbers which cover this distance. – Faruk Dec 27 '16 at 21:42
  • If you sample a sound at 16kHz, it means you have 16000 samples per second. Thus, `sample = time x sample_rate` – Dmytro Prylipko Dec 28 '16 at 09:58
  • yeah i figured it out, thank you very much. One last question, The data have been around 471 mb in `np.array` form and I don't have that much powerful computer. Even though I am using `Keras` with `Theano` backend to be able to make calculations on `gpu`(NVidia GTX 670M), I am not sure how long it take to train model? Is there any cloud computing service or anything else that can you suggest? – Faruk Dec 28 '16 at 10:15
  • 2
    471 MB is nothing. But you don't even need to pack everything into a single array. Neural networks are trained on batches of data that are read sequentially. If you do everything correctly, your GPU is more than enough. – Dmytro Prylipko Dec 28 '16 at 10:46
  • I am trying to train the model but loss is about 10.8 is not decreasing and acc is about 0.08 and is not increasing per epoch. I used `mfcc` for feature extraction and y_train is a vector that has 0,1,2,3 for each event. Didnt use one hot encoder. Tried to validation set also but it s not good also. `sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True) model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])` this is how i compile? what is wrong because it is not learning at all. – Faruk Dec 28 '16 at 12:35
  • I guess this thread becomes too large. Try open a new question. – Dmytro Prylipko Dec 28 '16 at 12:57
  • never mind, I builded the system in wrong way. I made some changes like setting `loss` in `compile` to `sparse_categorical_crossentropy` and the acc for 1 epoch is %80. At least Keras said so :). Anyway, I think it is all about optimizing it now on. – Faruk Dec 28 '16 at 14:07
  • Hey @Faruk, would it be possible for you to share any snippets? I am trying to do something similar. It would be great if you could tell me what worked – Singhal2 Jun 16 '20 at 04:57
  • Hey @Singhal2 , I am sorry because it was so long ago I am not sure if I can find the code or else. Sorry could not help – Faruk Jun 18 '20 at 10:00
3

You can use a recurrent neural network (RNN).

https://www.tensorflow.org/versions/r0.12/tutorials/recurrent/index.html

The input data is a sequence and you can put a label in every sample of the time series.

For example a LSTM (a kind of RNN) is available in libraries like tensorflow.

Rob
  • 1,080
  • 2
  • 10
  • 24
  • I made some researh about rnn and lstm and it seems i can use that. However, will i be able to build a rnn which can accept input data in different shapes? Sorry i am asking because i am newbie all about machine learning. – Faruk Dec 09 '16 at 09:30
  • 1
    In RNNs you work with time series, for example an audio signals, every audio signal can have a different size and every audio signal has a different label every time unit. The RNN iterates over all the audio signal producing a different output every time unit. – Rob Dec 09 '16 at 15:05
  • Yes, you divide every signal in small frames, then you extract features in every frame (or use directly the time signal). Every frame has a label. The RNN iterates and classify every frame as a function of the current and previous frames. – Rob Dec 09 '16 at 20:07