Google's WebRTC VAD algorithm (esp. "aggressiveness")

Question

I know Google's WebRTC VAD algorithm uses a Gaussian Mixture Model (GMM), but my math knowledge is weak, so I don't really understand what that means. Is it correct to say it's a statistically-based machine learning model, and for the case of VAD, one that has been trained to recognize speech vs. noise?

I'm writing a paper, and I've created a script that makes use of the API in order to distinguish voice from noise. It works, but I need to explain in my paper at a very basic level the mechanism it uses to make the decision.

Most pressingly, I need to know to some degree, what the "aggressiveness" setting does with regard to the algorithm. Is it literally just stipulating a confidence threshold? Does it have any acoustic repercussions?

Update:

My ultra-basic understanding is that: google probably trained their model over a bunch of pre-labeled "noise" and "speech" and stored the features of each; it then takes an unknown sample and sees whether it is more like the noise or speech data. I don't know what the features are measurements of, but I would assume that at least pitch and amplitude are measured.

It uses the GMM to calculate the probability that it belongs to one population or the other.

Aggressiveness likely sets the thresholds it uses for making a determination, but I don't exactly know how that part works.

The relevant code is here: https://chromium.googlesource.com/external/webrtc/+/refs/heads/master/common_audio/vad/vad_core.c

The "aggressiveness" setting determines the following constants (I show mode 0 and 3 for comparison):

// Constants used in WebRtcVad_set_mode_core().
//
// Thresholds for different frame lengths (10 ms, 20 ms and 30 ms).
//
// Mode 0, Quality.
static const int16_t kOverHangMax1Q[3] = { 8, 4, 3 };
static const int16_t kOverHangMax2Q[3] = { 14, 7, 5 };
static const int16_t kLocalThresholdQ[3] = { 24, 21, 24 };
static const int16_t kGlobalThresholdQ[3] = { 57, 48, 57 };

// Mode 3, Very aggressive.
static const int16_t kOverHangMax1VAG[3] = { 6, 3, 2 };
static const int16_t kOverHangMax2VAG[3] = { 9, 5, 3 };
static const int16_t kLocalThresholdVAG[3] = { 94, 94, 94 };
static const int16_t kGlobalThresholdVAG[3] = { 1100, 1050, 1100 };

I don't quite understand how overhang and local/global threshold come into play. Are these strictly statistical parameters?

If you are to explain what it means, then you have to learn what GMM does - we can't do that for you. As for the "aggressiveness", what did you find in the source code? Any particular piece that you can't understand? — Lukasz Tracewski, Apr 13 '19 at 13:17
hi Lukasz, thanks for your response. Here's the source file I am looking at: https://github.com/wiseman/py-webrtcvad/blob/master/cbits/webrtc/common_audio/vad/vad_core.c -- I believe it is the function `GmmProbability` that is ultimately responsible for deciding if a given frame of audio is speech or noise. Meanwhile, it is `WebRtcVad_set_mode_core` that handles the "agressiveness" parameter and assigns various properties based on that value. What I am unable to understand is where or how those values are used in making the VAD decision. I plan to update my question tomorrow with more detail — Tyler Peckenpaugh PhD, Apr 14 '19 at 20:22

ruoho ruotsi · Accepted Answer · 2020-05-16T18:28:38.000

Tracing the code, you'll see that the preset 4 values you listed above, which change based on "aggressiveness": kOverHangMax{1,2}*, kLocalThreshold*, kGlobalThreshold* that these map to these 4 internal arrays (indexed on aggressiveness):

self->over_hang_max_1[], self->over_hang_max_2[], self->individual[], self->total[]

Looking further at line 158 in vad_core.c, we see that the different values are used based on the frame length. The frame_length is the "atom" or "chunk" of audio under analysis:

// Set various thresholds based on frame lengths (80, 160 or 240 samples).
  if (frame_length == 80) {
    overhead1 = self->over_hang_max_1[0];
    overhead2 = self->over_hang_max_2[0];
    individualTest = self->individual[0];
    totalTest = self->total[0];
  } else if (frame_length == 160) {
    overhead1 = self->over_hang_max_1[1];
    overhead2 = self->over_hang_max_2[1];
    individualTest = self->individual[1];
    totalTest = self->total[1];
  } else {
    overhead1 = self->over_hang_max_1[2];
    overhead2 = self->over_hang_max_2[2];
    individualTest = self->individual[2];
    totalTest = self->total[2];
  }

Intuition

So the bigger the chunk of audio (240 samples) the more "aggressive" the algorithm, while the smaller 80 samples frames are the "less aggressive": But why is this? What is the intuition?

The calling-code (which uses vad_core) provides it with frames_length chunks of audio. So if the audiofile you're VAD-ing is 10 minutes long, then a sliding window over that audio will generate frame_length chunks and pass it to this code.

With audio running at 8000Hz sample rate, when the frame_length is small (80), the resolution (10ms) is fine-grained & the VAD signal will be very precise. Changes will be tracked accurately and the VAD estimation will be "reasonable" ... When frame_length is large (240) then the resolution is more "coarse", and the VAD signal will be less in-tune with minor (<30millisecond) changes in the signal's voice-activity ... and thus "less cautious".

So rather than aggressiveness, I'd rather talk about how "cautiously" or "assertively" it tracks the underlying voice-signal it is estimating.

I hope that helps reason about what it's doing. As for the values themselves, they are just algorithmic details that vary due to the different-sized audio frame.

Google's WebRTC VAD algorithm (esp. "aggressiveness")

1 Answers1

Intuition