Currently, I am parsing wav files and storing samples in std::vector<int16_t> sample
. Now, I want to apply VAD (Voice Activity Detection) on this data to find out the "regions" of voice, and more specifically the start and end of words.
The parsed wav files are 16KHz, 16 bit PCM, mono. My code is in C++.
I have searched a lot about it but could not find proper documentation regarding webRTC's VAD functions.
From what I have found, the function that I need to use is WebRtcVad_Process()
. It's prototype is written below :
int WebRtcVad_Process(VadInst* handle, int fs, const int16_t* audio_frame,
size_t frame_length)
From what I found here : https://stackoverflow.com/a/36826564/6487831
Each frame of audio that you send to the VAD must be 10, 20 or 30 milliseconds long. Here's an outline of an example that assumes audio_frame is 10 ms (320 bytes) of audio at 16000 Hz:
int is_voiced = WebRtcVad_Process(vad, 16000, audio_frame, 160);
It makes sense :
1 sample = 2B = 16 bits
SampleRate = 16000 sample/sec = 16 samples/ms
For 10 ms, no of samples = 160
So, based on that I have implemented this :
const int16_t * temp = sample.data();
for(int i = 0, ms = 0; i < sample.size(); i += 160, ms++)
{
int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
std::cout<<ms<<" ms : "<<isActive<<std::endl;
temp = temp + 160; // processed 160 samples
}
Now, I am not really sure if this is correct. Also, I am also unsure about whether this gives me correct output or not.
So,
- Is it possible to use the samples parsed directly from the wav files, or does it need some processing?
- Am I looking at the correct function to do the job?
- How to use the function to properly perform VAD on the audio stream?
- Is it possible to distinct between the spoken words?
- What is the best way to check if the output I am getting is correct?
- If not, what is the best way to do this task?