0

Based off the answer given in this topic I'm trying to implement a way to split the microphone input from pyaudio using librosa. But since I've never worked with audio, I'm failing to understand the best approach to this.

As I was anticipating, I started from the code in the answer above, just replacing

librosa.feature.mfcc(numpy_array)

with

print(librosa.effects.split(numpy_array))

just to see if audio was correctly catched, but all I get is lots of [0 2048]. What's the best way to "split" an audio input stream when silence is found? My goal is to create a list of portions of the audio to process.

Manuel Celli
  • 101
  • 1
  • 1
  • 9
  • What does "split" an audio input stream? That you wish to ignore the silences and trigger some further real-time processing on the non-silent part? What kind of processing is next? And what is the typical input audio? These things typically influences the appropriate hyperparameters for silence/activity detection – Jon Nordby Apr 30 '23 at 08:51
  • Hi @JonNordby, thanks for your time. For the first part, yes, my objective is to ignore silences and trigger further real-time processing on the non-silent part. The audio that the microphone is catching is speech, the processing I'm aiming to do is real-time speech to text and then re-process that text. – Manuel Celli May 02 '23 at 07:27
  • I forgot to mention that I'm using vosk for the speech recognition part, so the non-silent audio part should be in a format that can be processed by vosk, sadly I've never worked with audio, so I'm trying to figure things out as I go – Manuel Celli May 02 '23 at 07:42
  • So, you are essentially implementing a Voice Activity Detection system that will then trigger speech-to-text, and then text-processing. So it is voice that you want to actually trigger on - not "not silence" (but having high recall and low latency, such that no speech gets lost) – Jon Nordby May 03 '23 at 09:23
  • Yes @JonNordby, exactly! I apologize for my previous lack of explaination, since this is a first for me. I managed to do that without using librosa, but using webrtcvad. Still have to fine tune the wait time before we can infer that the utterance is finished, but this should suffice. – Manuel Celli May 03 '23 at 10:40
  • 1
    Great that you found and posted a solution :) – Jon Nordby May 03 '23 at 12:41

1 Answers1

1

Answering my own question, I ditched librosa in favor of webrtcvad for non-speech detection since it has ha method that does exactly that. The module webrtcvad sadly has some restrictions in the kind of input that it can parse, but seems to be doing good enough for my use case.

import json
import time
import pyaudio
import webrtcvad
from queue import Queue
from threading import Thread
from vosk import Model, KaldiRecognizer

# Audio settings
FRAME_RATE = 16000
CHUNK_SIZE = 160
AUDIO_FORMAT = pyaudio.paInt16
CHANNELS = 1
SILENCE_LIMIT = 4

# Speech recognition settings
# Initialize WebRTC VAD
vad = webrtcvad.Vad()
# Aggressive VAD mode
vad.set_mode(3)
model = Model(model_name="vosk-model-small-en-us-0.22")
recognizer = KaldiRecognizer(model, FRAME_RATE)
recognizer.SetWords(True)

# Queues
messages = Queue()
recordings = Queue()

def record_microphone():
    p = pyaudio.PyAudio()

    stream = p.open(format=AUDIO_FORMAT,
                    channels=CHANNELS,
                    rate=FRAME_RATE,
                    input=True,
                    frames_per_buffer=CHUNK_SIZE)

    while not messages.empty():
        recordings.put(stream.read(CHUNK_SIZE))

    stream.stop_stream()
    stream.close()
    p.terminate()
    
def speech_recognition():
    buffer = b""
    in_speech = False
    silence_threshold = 0
    
    while not messages.empty():
        if not recordings.empty():
            frames = recordings.get()
            assert webrtcvad.valid_rate_and_frame_length(FRAME_RATE, CHUNK_SIZE)
            is_speech = vad.is_speech(frames, sample_rate=FRAME_RATE)

            if is_speech:
                if not in_speech:
                    # if speech is detected but script not aware of speech
                    # make it aware
                    in_speech = True
                # put 10ms of the audio (160 frames) in the buffer
                buffer += frames
                silence_threshold = 0
            elif not is_speech and in_speech:
                # if no speech is detected but script was expecting speech
                # check if the silence_threshold is less than 4 seconds
                # if it is, increase the silence_threshold of 10ms
                # otherwise 4 seconds have passed and the user stopped speaking
                # so we can proceed to process the buffer in waveform and reset
                if silence_threshold < SILENCE_LIMIT * (FRAME_RATE / CHUNK_SIZE):
                    silence_threshold += 1
                else:
                    recognizer.AcceptWaveform(buffer)
                    print(json.loads(recognizer.Result())["text"])
                    in_speech = False
                    silence_threshold = 0
                    buffer = b""                
        
def start_recording():
    messages.put(True)

    print("Starting...")
    record = Thread(target=record_microphone)
    record.start()
    transcribe = Thread(target=speech_recognition)
    transcribe.start()
    print("Listening.")

def stop_recording():
    messages.get()
    print("Stopped.")
    
if __name__ == "__main__":
    start_recording()
    time.sleep(35)
    stop_recording()

Being the first time I do something of this sort, the code can (and probably will) be optimized but I'm living this here as a draft for whoever needs it in the future.

Manuel Celli
  • 101
  • 1
  • 1
  • 9