6

Currently, I am parsing wav files and storing samples in std::vector<int16_t> sample. Now, I want to apply VAD (Voice Activity Detection) on this data to find out the "regions" of voice, and more specifically the start and end of words.

The parsed wav files are 16KHz, 16 bit PCM, mono. My code is in C++.

I have searched a lot about it but could not find proper documentation regarding webRTC's VAD functions.

From what I have found, the function that I need to use is WebRtcVad_Process(). It's prototype is written below :

int WebRtcVad_Process(VadInst* handle, int fs, const int16_t* audio_frame,
                      size_t frame_length)

From what I found here : https://stackoverflow.com/a/36826564/6487831

Each frame of audio that you send to the VAD must be 10, 20 or 30 milliseconds long. Here's an outline of an example that assumes audio_frame is 10 ms (320 bytes) of audio at 16000 Hz:

int is_voiced = WebRtcVad_Process(vad, 16000, audio_frame, 160);

It makes sense :

1 sample = 2B = 16 bits  
SampleRate = 16000 sample/sec = 16 samples/ms  
For 10 ms, no of samples    =   160  

So, based on that I have implemented this :

const int16_t * temp = sample.data();
for(int i = 0, ms = 0; i < sample.size(); i += 160, ms++)
{
    int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
    std::cout<<ms<<" ms : "<<isActive<<std::endl;
    temp = temp + 160; // processed 160 samples
}

Now, I am not really sure if this is correct. Also, I am also unsure about whether this gives me correct output or not.

So,

  • Is it possible to use the samples parsed directly from the wav files, or does it need some processing?
  • Am I looking at the correct function to do the job?
  • How to use the function to properly perform VAD on the audio stream?
  • Is it possible to distinct between the spoken words?
  • What is the best way to check if the output I am getting is correct?
  • If not, what is the best way to do this task?
Saurabh Shrivastava
  • 1,055
  • 1
  • 12
  • 26

1 Answers1

7

I'll start by saying that no, I don't think you will be able to segment an utterance into individual words using VAD. From the article on speech segmentation in Wikipedia:

One might expect that the inter-word spaces used by many written languages like English or Spanish would correspond to pauses in their spoken version, but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word.

That said, I'll try to answer your other questions.

  1. You need to decode the WAV file, which could be compressed, into raw PCM audio data before running VAD. See e.g. Reading and processing WAV file data in C/C++. Alternately, you could use something like sox to convert the WAV file to raw audio before running your code. This command will convert a WAV file of any format to 16 KHz, 16-bit PCM in the format that WebRTCVAD expects:

    sox my_file.wav -r 16000 -b 16 -c 1 -e signed-integer -B my_file.raw
    
  2. It looks like you are using the right function. To be more specific, you should be doing this:

    #include "webrtc/common_audio/vad/include/webrtc_vad.h"
    // ...
    VadInst *vad;
    WebRtcVad_Create(&vad);
    WebRtcVad_Init(vad);
    const int16_t * temp = sample.data();
    for(int i = 0, ms = 0; i < sample.size(); i += 160, ms += 10)
    {
      int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
      std::cout << ms << " ms : " << isActive << std::endl;
      temp = temp + 160; // processed 160 samples (320 bytes)
    }
    
  3. To see if it's working, you can run known files and see if you get the results you expect. For example, you could start by processing silence and confirm that you never (or rarely--this algorithm is not perfect) see a voiced result come back from WebRtcVad_Process. Then you could try a file that is all silence except for one short utterance in the middle, etc. If you want to compare to an existing test, the py-webrtcvad module has a unit test that does this; see the test_process_file function.

  4. To do word-level segmentation, you will probably need to find a speech recognition library that does it or gives you access to the information that you need to do it. E.g. this thread on the Kaldi mailing list seems to talks about how to segment by words.

John Wiseman
  • 3,081
  • 1
  • 22
  • 31
  • Yes, I am already decoding (parsing) the wave file & the `sample` is the vector containing those samples. (https://github.com/saurabhshri/CCAligner/blob/development/src/lib_ccaligner/read_wav_file.cpp#L216) I was unsure about the `audio_frame` in the function prototype and was dubious if doing `temp = temp + 160;` is the right way to do it. Also, the way I should interpret the output is : "x - y ms : 0 // this doesn't have speech" "x - y ms : 1 // this section has speech", right? I will look into something to distinguish between the words. I will try your suggestion for testing now. – Saurabh Shrivastava Jun 09 '17 at 18:07
  • 1
    Yes, `WebRtcVad_Process` is expecting `const int16_t*` as the audio frame, so you're doing the right thing. And while you're processing 320 bytes at a time, `WebRtcVad_Process` expects the number of _samples_, and each of your samples is 2 bytes, so you have 160 samples. Similarly, adding 1 to an `int16_t` pointer will advance it by 2 bytes, so adding 160 is correct. I'll edit my answer to compute the millisecond timestamp correctly. – John Wiseman Jun 09 '17 at 19:40
  • I am really thankful for all your help. Perfectly answers all my questions. Marked as accepted! :) One more thing, how do you suggest I can improve the accuracy? What according to you is a good "frame_length" and "aggressiveness" ? – Saurabh Shrivastava Jun 09 '17 at 19:53
  • 3
    I usually use an aggressiveness of 3 (which means the VAD has to be _really sure_ something is speech before it will categorize it as such) and 30 ms frames. I think you might want to change the aggressiveness based on the background noise level--with less noise, you could make it less aggressive. I don't really know the tradeoffs of frame length--I thought maybe by looking at a larger piece of audio the VAD would have more information to make a classification, but that's just a guess. See https://stackoverflow.com/a/36826188/122762 for a sliding window technique that might help you. – John Wiseman Jun 09 '17 at 20:37
  • That sliding window approach is very close to something I too was thinking. Thanks a lot for linking that! :) – Saurabh Shrivastava Jun 11 '17 at 15:28