I know Google's WebRTC VAD algorithm uses a Gaussian Mixture Model (GMM), but my math knowledge is weak, so I don't really understand what that means. Is it correct to say it's a statistically-based machine learning model, and for the case of VAD, one that has been trained to recognize speech vs. noise?
I'm writing a paper, and I've created a script that makes use of the API in order to distinguish voice from noise. It works, but I need to explain in my paper at a very basic level the mechanism it uses to make the decision.
Most pressingly, I need to know to some degree, what the "aggressiveness" setting does with regard to the algorithm. Is it literally just stipulating a confidence threshold? Does it have any acoustic repercussions?
Update:
My ultra-basic understanding is that: google probably trained their model over a bunch of pre-labeled "noise" and "speech" and stored the features of each; it then takes an unknown sample and sees whether it is more like the noise or speech data. I don't know what the features are measurements of, but I would assume that at least pitch and amplitude are measured.
It uses the GMM to calculate the probability that it belongs to one population or the other.
Aggressiveness likely sets the thresholds it uses for making a determination, but I don't exactly know how that part works.
The relevant code is here: https://chromium.googlesource.com/external/webrtc/+/refs/heads/master/common_audio/vad/vad_core.c
The "aggressiveness" setting determines the following constants (I show mode 0 and 3 for comparison):
// Constants used in WebRtcVad_set_mode_core().
//
// Thresholds for different frame lengths (10 ms, 20 ms and 30 ms).
//
// Mode 0, Quality.
static const int16_t kOverHangMax1Q[3] = { 8, 4, 3 };
static const int16_t kOverHangMax2Q[3] = { 14, 7, 5 };
static const int16_t kLocalThresholdQ[3] = { 24, 21, 24 };
static const int16_t kGlobalThresholdQ[3] = { 57, 48, 57 };
// Mode 3, Very aggressive.
static const int16_t kOverHangMax1VAG[3] = { 6, 3, 2 };
static const int16_t kOverHangMax2VAG[3] = { 9, 5, 3 };
static const int16_t kLocalThresholdVAG[3] = { 94, 94, 94 };
static const int16_t kGlobalThresholdVAG[3] = { 1100, 1050, 1100 };
I don't quite understand how overhang and local/global threshold come into play. Are these strictly statistical parameters?