My architecture contains a CNN as the feature extractor and dense layers that map into my final classes to perform the classification of audio clips. The CNN takes as input a STFT from an audio clip of 1s which is then passed through the feed forward network to determine the class. The classifier is working very well. But...
When I pass it through an audio file, 1 second frame at a time with a given time step, I am not sure how to aggregate my classifier outputs to form a single decision. How should I build my architecture to account for this time changing input.
Let me illustrate with an example,imagine the output of my classifier for each time frame is 1/0, where I am taking 20ms time steps and i want to detect the word HELLO:
Audio: " my name is X and i say HELLO world! " output: "000000000000000000000000000111111111000000000"
My classifier correctly detected the utterance of the world "HELLO" in every time step that I took around the word HELLO (before/during/after the word was uttered).
How can i combine all the 1's into one single decision. Should i have another MLP that collects the decision of each time step as its input? How should I do this at training time?