3

I'm currently developing a keyword-spotting system that recognizes digits from 0 to 9 using deep neural networks. I have a dataset of people saying the numbers(namely the TIDIGITS dataset, collected at Texas Instruments, Inc), however the data is not prepared to be fed into a neural network, because not all the audio data have the same audio length, plus some of the files contain several digits being spoken in sequence, like "one two three".

Can anyone tell me how would I transform these wav files into 1 second wav files containing only the sound of one digit? Is there any way to automatically do this? Preparing the audio files individually would be time expensive.

Thank you in advance!

UrmLmn
  • 1,099
  • 2
  • 16
  • 28
  • have you perused : A free audio dataset of spoken digits. Think MNIST for audio ... https://github.com/Jakobovski/free-spoken-digit-dataset – Scott Stensland Apr 29 '18 at 14:29
  • Hello @ScottStensland! Thank you for your reply! The dataset I'm using on this project is a requirement, so I can't really change that :/ – UrmLmn Apr 29 '18 at 14:33
  • Related https://stackoverflow.com/questions/19419098/how-to-train-on-and-make-a-serialized-feature-vector-for-a-neural-network – Nikolay Shmyrev Aug 19 '18 at 07:55

2 Answers2

4

It depends on what other datasets you have however here is one approach : Just blindly cut out one second snippets from your audio then perform some judgement as to whether each audio snippet file is actually a single spoken digit.

For each input audio file define a one second window which you pluck out and save into its own file then slide this window further into the audio file and again pluck out the next snippet into its own file.

Since we want one second clips and we do not know where in the source input file our digits lay, once the first window snippet is saved only slide say 100ms over and pluck out the next window. So for each input audio file we will create a succession of overlapping snippets each with its starting point only 100ms separated from previous snippet. To perform this easily use command line tool ffmpeg

https://ffmpeg.org/ffmpeg.html

https://ffmpeg.org/ffmpeg-utils.html#time-duration-syntax

input_audio=audio_from_your_dataset.wav

output_audio=output/aaa

ffmpeg -i $input_audio -ss 0    -t 1 -acodec copy ${output_audio}.0.00.wav
ffmpeg -i $input_audio -ss 0.20 -t 1 -acodec copy ${output_audio}.0.20.wav
ffmpeg -i $input_audio -ss 0.40 -t 1 -acodec copy ${output_audio}.0.40.wav
ffmpeg -i $input_audio -ss 0.60 -t 1 -acodec copy ${output_audio}.0.60.wav
ffmpeg -i $input_audio -ss 0.80 -t 1 -acodec copy ${output_audio}.0.80.wav
ffmpeg -i $input_audio -ss 1.00 -t 1 -acodec copy ${output_audio}.1.00.wav  
ffmpeg -i $input_audio -ss 1.20 -t 1 -acodec copy ${output_audio}.1.20.wav

in above the parm -ss defines starting point of snippet in seconds ... so 0.60 will start 600ms into the file ... parm -t defines length of window in seconds

so its output will be

./output/aaa.0.00.wav
./output/aaa.0.20.wav
./output/aaa.0.40.wav
./output/aaa.0.60.wav
./output/aaa.0.80.wav
./output/aaa.1.00.wav   
./output/aaa.1.20.wav

issue above on command line ... its not limited to just wav, other codecs are OK too ... now you have several one second snippet audio files plucked from the same input audio ... I would then wrap above process with a meta process which varies the width of your window ... nothing in stone says 1 second so do all of above for windows which vary say from 0.1 seconds to 1 second ... this will geometrically explode the number of snippet files you generate ... bonus points if you add another outermost loop where you vary the incremental time each window starting point is slide over by since 100ms should also be a free variable ... so your code should define three for loops around your ffmpeg calls ( to advance across the input file, to vary window width, to vary window slide )

ffmpeg is the industry standard Swiss Army Knife for audio/video manipulation ( along with Sox ) ... in addition to a command line set of tools ffmpeg is also a set of libraries callable from any language (python, go, ...)

Now perform some ML to identify which of these snippets most closely match what a known spoken digit sounds like to identify which snippets you keep or discard

Scott Stensland
  • 26,870
  • 12
  • 93
  • 104
1

I would split each wav by the areas of silence. Trim the silence from beginning and end. Then I'd run each one through a FFT for different sections. Smaller ones at the beginning of the sound. Then I'd normalise the frequencies against the fundamental. Then I'd feed the results into the NN as a 3d array of volumes, frequencies and times.

Ian McGarry
  • 145
  • 1
  • 11