How can I get the start and end times of words in an audio file with a known transcript using Vosk?

Question

I'm using Vosk (https://alphacephei.com/vosk/) in Python and I want to get the start and end times of every word in an audio file, and I have the transcript of the audio file.

I'm using some code I found online to perform speech-to-text using Vosk, and it also gives the start and end times of every word. Unfortunately the transcription isn't perfect.

Since I have the perfect transcript, I want to tell Vosk what the correct transcript is and have it tell me the start and end times of every word. Is this possible?

Here is the code I'm using now:

import wave
import json

from vosk import Model, KaldiRecognizer

model_path = r".\vosk_models\vosk-model-en-us-0.22"
audio_filename = "some_audio_file.wav"

model = Model(model_path)
wf = wave.open(audio_filename, "rb")
rec = KaldiRecognizer(model, wf.getframerate())
rec.SetWords(True)  # Include the start and end times for each word in the output

# get the list of JSON dictionaries
results = []
# recognize speech using vosk model
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        part_result = json.loads(rec.Result())
        results.append(part_result)
part_result = json.loads(rec.FinalResult())
results.append(part_result)

wf.close()  # close audiofile

score 0 · Answer 1 · answered Jan 14 '23 at 16:58

Perhaps you could make use of sttcast. It uses vosk to transcribe to an HTML file from which you can collect timestamps and text to correct. I think it is possible to automatize the task if you have hundreds of hours of audio, but for only a few hours, you should consider making it manually

How can I get the start and end times of words in an audio file with a known transcript using Vosk?

1 Answers1