12

I use OpenAI's Whisper python lib for speech recognition. How can I get word-level timestamps?


To transcribe with OpenAI's Whisper (tested on Ubuntu 20.04 x64 LTS with an Nvidia GeForce RTX 3090):

conda create -y --name whisperpy39 python==3.9
conda activate whisperpy39
pip install git+https://github.com/openai/whisper.git 
sudo apt update && sudo apt install ffmpeg
whisper recording.wav
whisper recording.wav --model large

If using an Nvidia GeForce RTX 3090, add the following after conda activate whisperpy39:

pip install -f https://download.pytorch.org/whl/torch_stable.html
conda install pytorch==1.10.1 torchvision torchaudio cudatoolkit=11.0 -c pytorch
Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501

3 Answers3

9

I created a repo to recover word-level timestamps (and confidence), and also more accurate segment timestamps: https://github.com/Jeronymous/whisper-timestamped

It is built based on the cross-attention weights of Whisper, as in this notebook in the Whisper repo. I tuned a bit the approach to get better location, and added the possibility to get the cross-attention on the fly, so there is no need to run the Whisper model twice. There is no memory issue when processing long audio.

Note: first, I tried the approach of using wav2vec model to realign Whisper's transcribed words to input audio. It works reasonably well, but it has many drawbacks : it needs to handle a separate (wav2vec) model, to perform another inference on the full signal, to have one wav2vec model per language, to normalize the transcribed text so that the set of characters fits the one of wav2vec model (e.g. converting numbers in characters, symbols like "%", currencies...). Also the alignment can have troubles on disfluencies that are usually removed by Whisper (so part of what would recognize wav2vec model is missing, like start of sentences that are reformulated).

Jeronymous
  • 11
  • 1
  • 2
  • Btw this lib is thing of beauty. In my case, I already have a transcript of audio and only need timestamps. Is there a way to feed the transcript to improve accuracy? For example by using initial_prompt option? – Mallory-Erik Mar 22 '23 at 00:55
  • @Mallory-Erik I haven't looked at the Jeronymous repository but if you look at cell 19 in the latest Whisper notebook (Multilingual_ASR.ipynb), the line starting "tokens = torch.tensor(..." uses the transcribed tokens to generate the model input. You can replace tokens with your own transcript to achieve what you want. – Nick Fisher Apr 05 '23 at 01:08
  • wav2vec models are also quite good to align a transcript with audio signal, as they assign a probability on characters for each frame of audio. You can have a look at https://pytorch.org/audio/stable/tutorials/forced_alignment_tutorial.html You just have to make sure to find a wav2vec model for the language you want to process, and that the character set of this model is consistent with your transcript (you might need to normalize your transcript, with things like lower case, "num2words" conversion, punctuation removal,...). – Jeronymous Apr 06 '23 at 22:04
7

https://openai.com/blog/whisper/ only mentions "phrase-level timestamps", I infer from it that word-level timestamps are not obtainable without adding more code.

From one of the Whisper authors:

Getting word-level timestamps are not directly supported, but it could be possible using the predicted distribution over the timestamp tokens or the cross-attention weights.

https://github.com/jianfch/stable-ts (MIT License):

This script modifies methods of Whisper's model to gain access to the predicted timestamp tokens of each word without needing addition inference. It also stabilizes the timestamps down to the word level to ensure chronology.

Note that:


Another option: use some word-level forced alignment program. E.g., Lhotse (Apache-2.0 license) has integrated both Whisper ASR and Wav2vec forced alignment:

enter image description here

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
-1

One can use the Python package https://github.com/m-bain/whisperX:

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501