I'm currently trying to create and train a neural network to perform simple speech classification using MFCCs.
At the moment, I'm using 26 coefficients for each sample, and a total of 5 different classes - these are five different words with varying numbers of syllables.
While each sample is 2 seconds long, I am unsure how to handle cases where the user can pronounce words either very slowly or very quickly. E.g., the word 'television' spoken within 1 second yields different coefficients than the word spoken within two seconds.
Any advice on how I can solve this problem would be much appreciated!