I'm using Vosk (https://alphacephei.com/vosk/) in Python and I want to get the start and end times of every word in an audio file, and I have the transcript of the audio file.
I'm using some code I found online to perform speech-to-text using Vosk, and it also gives the start and end times of every word. Unfortunately the transcription isn't perfect.
Since I have the perfect transcript, I want to tell Vosk what the correct transcript is and have it tell me the start and end times of every word. Is this possible?
Here is the code I'm using now:
import wave
import json
from vosk import Model, KaldiRecognizer
model_path = r".\vosk_models\vosk-model-en-us-0.22"
audio_filename = "some_audio_file.wav"
model = Model(model_path)
wf = wave.open(audio_filename, "rb")
rec = KaldiRecognizer(model, wf.getframerate())
rec.SetWords(True) # Include the start and end times for each word in the output
# get the list of JSON dictionaries
results = []
# recognize speech using vosk model
while True:
data = wf.readframes(4000)
if len(data) == 0:
break
if rec.AcceptWaveform(data):
part_result = json.loads(rec.Result())
results.append(part_result)
part_result = json.loads(rec.FinalResult())
results.append(part_result)
wf.close() # close audiofile