19

I've been working with Python speech recognition for the better part of a month now, making a JARVIS-like assistant. I've used both the Speech Recognition module with Google Speech API and Pocketsphinx, and I've used Pocketsphinx directly without another module. While the recognition is accurate, I've had a hard time working with the large amount of time these packages take to process speech. The way they seem to work is by recording from one point of silence to another, and then passing the recording to the STT engine. While the recording is being processed, no other sound can be recorded for recognition, which can be a problem if I'm trying to issue multiple complex commands in series.

When looking at the Google Assistant voice recognition, Alexa's voice recognition, or Mac OS High Sierra's offline recognition, I see words being recognized as I say them without any pause in the recording. I've seen this called realtime recognition, streaming recognition, and word-by-word recognition. Is there any way to do this in Python, preferably offline without using a client?

I tried (unsuccessfully) to accomplish this by changing pause threshold, speaking threshold, and non-speaking threshold for the SpeechRecognition recognizer, but that just caused the audio to segment strangely and still needed a second after each recognition before it could record again.

Elias N-d
  • 199
  • 1
  • 1
  • 4

2 Answers2

10

First of all, there is a python library called, VOSK. to install it on your computer type this command

pip3 install vosk

for more details please visit:

https://alphacephei.com/vosk/install

now we have to download the model for that go to this website and choose your preferred model and download it:

https://alphacephei.com/vosk/models here I use " vosk-model-small-en-us-0.15 " as my model

after download, you can see it is a compressed file unzip it in your root folder, like this

speech-recognition/
    ├─ vosk-model-small-en-us-0.15 ( Unzip follder ) 
    ├─ offline-speech-recognition.py ( python file )

here is the full code :

    from vosk import Model, KaldiRecognizer
    import pyaudio
    
    model = Model(r"C:\\Users\User\Desktop\python practice\ai\vosk-model-small-en-us-0.15")
    recognizer = KaldiRecognizer(model, 16000)
    
    mic = pyaudio.PyAudio()
    stream = mic.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8192)
    stream.start_stream()
    
    while True:
        data = stream.read(4096)
        
    
        if recognizer.AcceptWaveform(data):
            text = recognizer.Result()
            print(f"' {text[14:-3]} '")

for more detail you can read this article I've written : https://buddhi-ashen-dev.vercel.app/posts/offline-speech-recognition

Buddhi ashen
  • 101
  • 1
  • 6
9

Pocketsphinx can process streams, see here

Python pocketsphinx recognition from the microphone

Kaldi can process streams too (more accurate than pocketsphinx)

https://github.com/alphacep/kaldi-websocket-python/blob/master/test_local.py

Google speech API can also process streams, see here:

Google Streaming Speech Recognition on an Audio Stream Python

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
  • The Kaldi link is broken. Do you know where the project exists now, if it still does? – Sylvester Kruin Oct 17 '21 at 18:48
  • Sylvester, i dont know if you are still here, but i found the updated link: https://github.com/kaldi-asr/kaldi just so anyone comes here can go to updated link – Pear Mar 23 '22 at 14:22