Using a file very similar to test_ffmpeg.py in the Vosk repository, I am exploring what text information I can get out of the audio file.
Here is the code of the whole script I'm using.
#!/usr/bin/env python3
from vosk import Model, KaldiRecognizer, SetLogLevel
import sys
import os
import wave
import subprocess
import json
SetLogLevel(0)
if not os.path.exists("model"):
print ("Please download the model from https://alphacephei.com/vosk/models and unpack as 'model' in the current folder.")
exit (1)
sample_rate=16000
model = Model("model")
rec = KaldiRecognizer(model, sample_rate)
process = subprocess.Popen(['ffmpeg', '-loglevel', 'quiet', '-i',
sys.argv[1],
'-ar', str(sample_rate) , '-ac', '1', '-f', 's16le', '-'],
stdout=subprocess.PIPE)
file = open(sys.argv[1]+".txt","w+")
while True:
data = process.stdout.read(4000)
if len(data) == 0:
break
if rec.AcceptWaveform(data):
file.write(json.loads(rec.Result())['text']+"\n\n")
#print(rec.Result())
#else:
#print(rec.PartialResult())
#print(json.loads(rec.Result())['text'])
file.write(json.loads(rec.Result())['text'])
file.close()
This example works well, however, the only return I can find out of rec.PartialResult() and rec.Result() is a string dictionary with the result. Is there a way to query the KaldiRecognizer on the timing individual words were found within the audio file?
As I'm typing this, I'm already thinking that elaborating on the result, and detecting changes in the partial result compared with the current samples will get me what I want, but I'm sticking this up here just in case it's already implemented.