Train CNN on multiple audio frames

Question

My architecture contains a CNN as the feature extractor and dense layers that map into my final classes to perform the classification of audio clips. The CNN takes as input a STFT from an audio clip of 1s which is then passed through the feed forward network to determine the class. The classifier is working very well. But...

When I pass it through an audio file, 1 second frame at a time with a given time step, I am not sure how to aggregate my classifier outputs to form a single decision. How should I build my architecture to account for this time changing input.

Let me illustrate with an example,imagine the output of my classifier for each time frame is 1/0, where I am taking 20ms time steps and i want to detect the word HELLO:

Audio: " my name is X and i say HELLO world! " output: "000000000000000000000000000111111111000000000"

My classifier correctly detected the utterance of the world "HELLO" in every time step that I took around the word HELLO (before/during/after the word was uttered).

How can i combine all the 1's into one single decision. Should i have another MLP that collects the decision of each time step as its input? How should I do this at training time?

Does this answer your question? [How to classify continuous audio](https://stackoverflow.com/questions/41047151/how-to-classify-continuous-audio) — Dmytro Prylipko, Feb 25 '21 at 08:59
Thanks @DmytroPrylipko, but no it doesn't answer my question. It discusses things I have already achieved: i can do keyword spotting on each single frame (of 1s) accurately. The question is how do i aggregate the predictions from multiple frames given that I have a small step size of about 20ms. — vbfh, Feb 26 '21 at 12:00
Now you can employ some sequence model to aggregate the result. For example a recurrent neural network or hidden Markov model (HMM). Usual way to combine a fixed-size CNN with those is to pass probability distribution (output of softmax) or logits to the dynamic HMM or RNN. — Dmytro Prylipko, Feb 26 '21 at 12:44
Yes that's helpful @DmytroPrylipko. What is not clear to me is if I will need to have this CNN+RNN architecture at training time or inference time? It sounds like I need to train the CNN+RNN together in which case i am not sure how to structure my input audio data. Currently i am using speech command files that are 1 second in length that get passed through my CNN and softmax for a probability distribution over the output classes. Does this mean that I need to add padding or noise to make the sequence longer than one second so that I have multiple outputs from my CNN that I pass to my RNN. — vbfh, Mar 10 '21 at 17:45
You need to have the combination of CNN+RNN both for training and inference. You train it jointly (ok, some pre-training of CNN is possible) and run together for inference. You do not need any padding or so, RNNs are able to process sequences of arbitrary length. See this answer: https://stats.stackexchange.com/questions/129411/how-can-recurrent-neural-networks-be-used-for-sequence-classification — Dmytro Prylipko, Mar 11 '21 at 09:03

Train CNN on multiple audio frames

0 Answers0