OpenAI's Whisper delivers nice and clean transcripts. Now I would like it to produce more raw transcripts that also have filler words (ah, mh, mhm, uh, oh, etc.) in it. The post here tells me that it's possible by setting the normalization to false: https://huggingface.co/spaces/openai/whisper/discussions/30
I managed to use this code, however I only get whisper to transcribe 30 seconds. How can I make it process a longer audio files?
Please note that I'm a total beginner with whisper as well as python.
What I've done so far: I mainly use the code from https://huggingface.co/spaces/openai/whisper/discussions/30 Since I don't want to use a dummy dataset, I load my local mp3 with librosa. I guess there are also other ways to do this and I'm open for it.
As I understand it, instructing the whisper processor is necessary for deactivating the normalization. Thus, not whisper is used here (import whisper
) but whisper via transformers
. The relevant switch is normalize = False
.
My code (myscript.py):
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
speech, _ = librosa.load("myaudio.mp3", sr=16000, mono=True)
processor = WhisperProcessor.from_pretrained("openai/whisper-large")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language = "de", task = "transcribe")
input_features = processor(speech, return_tensors="pt", sampling_rate=16000).input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens = True, normalize = False)
print(transcription)
This works fine so far. However, only the first 30 seconds are transcribed.
Transcription of longer audio files should be possible with pipeline
, as mentioned here:
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method.
According to this webpage, the code for this is
import torch
from transformers import pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-base",
chunk_length_s=30,
device=device,
)
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]
prediction = pipe(sample.copy())["text"]
# we can also return timestamps for the predictions
prediction = pipe(sample, return_timestamps=True)["chunks"]
Now I'm having trouble combining my code with this code. How do I load my local mp3 instead of the dataset? (librosa doesn't seem to work here.) Where can I set normalization to false?