How to feed a numpy array as audio for whisper model

Question

So I want to open an mp3 using AudioSegment, then I want to convert the AudioSegment object to numpy array and use this numpy array as input for whisper model, I followed this question How to create a numpy array from a pydub AudioSegment? but non of the result was helpful since I get always error like

    Traceback (most recent call last):
  File "E:\Programmi\PythonProjects\whisper_real_time\test\converting_test.py", line 19, in <module>
    result = audio_model.transcribe(arr_copy, language="en", word_timestamps=True,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Programmi\PythonProjects\whisper_real_time\venv\Lib\site-packages\whisper\transcribe.py", line 121, in transcribe
    mel = log_mel_spectrogram(audio, padding=N_SAMPLES)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Programmi\PythonProjects\whisper_real_time\venv\Lib\site-packages\whisper\audio.py", line 146, in log_mel_spectrogram
    audio = F.pad(audio, (0, padding))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 86261939712 bytes.

This error is strange because if I provide directly the file like below I get no problems

result = audio_model.transcribe("../audio_test_files/1001_IEO_DIS_HI.mp3", language="en", word_timestamps=True,
                                        fp16=torch.cuda.is_available())

This is the code I wrote

from pydub import AudioSegment
import numpy as np
import whisper
import torch


audio = AudioSegment.from_mp3("../audio_test_files/1001_IEO_DIS_HI.mp3")

dtype = getattr(np, "int{:d}".format(
    audio.sample_width * 8))  # Or could create a mapping: {1: np.int8, 2: np.int16, 4: np.int32, 8: np.int64}
arr = np.ndarray((int(audio.frame_count()), audio.channels), buffer=audio.raw_data, dtype=dtype)
arr_copy = arr.copy()
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading whisper...")
audio_model = whisper.load_model("small", download_root="../models",
                                     device=device)
print(f"Transcribing...")
result = audio_model.transcribe(audio=arr_copy, language="en", word_timestamps=True,
                                        fp16=torch.cuda.is_available())  # , initial_prompt=result.get('text', ""))
text = result['text'].strip()
print(text)

how can I do it?

--------EDIT-------- I edited the code, and now I use the code below. I don't have the error that I had before but the model seems to don't transcribe correctly. I tested what audio I was passing to the model exporting back the wav file, I played it and there is a lot of noise, I can't understand what they are saying so that's why the model does not transcribe. Are the passage of normalization that I am doing ok?

from pydub import AudioSegment
import numpy as np
import whisper
import torch

language = "en"
model = "medium"
model_path = "../models"

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading whisper {model} model {language}...")
audio_model = whisper.load_model(model, download_root=model_path, device=device)

# load wav file with pydub
audio_path = "20230611-004146_audio_chunk.wav"
audio_segment = AudioSegment.from_wav(audio_path)
#audio_segment = audio_segment.low_pass_filter(1000)
# get sample rate
sample_rate = audio_segment.frame_rate
arr = np.array(audio_segment.get_array_of_samples())
arr_copy = arr.copy()
arr_copy = torch.from_numpy(arr_copy)
arr_copy = arr_copy.to(torch.float32)
# normalize
arr_copy = arr_copy / 32768.0
# to device
arr_copy = arr_copy.to(device)


print(f"Transcribing...")
result = audio_model.transcribe(arr_copy, language=language, fp16=torch.cuda.is_available())
text = result['text'].strip()
print(text)

waveform = arr_copy.cpu().numpy()
audio_segment = AudioSegment(
    waveform.tobytes(),
    frame_rate=sample_rate,
    sample_width=waveform.dtype.itemsize,
    channels=1
)
audio_segment.export("test.wav", format="wav")

What `dtype` does your code end up choosing? What is the shape of the numpy array that it is trying to create? — jared, Jun 10 '23 at 21:29
that is what I'm trying to know about whisper model, I don't know what shape the model wants as input @jared — JayJona, Jun 10 '23 at 21:34
@jared yes sorry, I updated the question with the full traceline. Also I added a copy to the code to solve a warning — JayJona, Jun 10 '23 at 21:57
If providing the file directly works, why are you trying to put it into an array and pass that? — jared, Jun 10 '23 at 22:02
because in my program I need to process audio in real time using chunks of audio converted in AudioSegment. I provided this example with an mp3 to show that I can't convert from an AudioSegment object to a numpy array @jared — JayJona, Jun 10 '23 at 22:06

igrinis · Accepted Answer · 2023-06-15T07:22:34.763

If I remember right, internally Whisper operates on 16kHz mono audio segments of 30 seconds. The conversion to the correct format, splitting and padding is handled by transcribe function. This is why when you supply the MP3 path it is working correctly.

If you want to supply numpy array you need to do the format and sample rate conversion by yourself. I suggest you start by creating a short (say 10 sec) audio clip in WAV PCM format. Loading it should provide you an int16 array of 160000 samples (10sec * 16kHz = 160000). Convert the values to float32, and normalize by dividing it by 32768.0. The result should be accepted by Whisper.

audio_segment = AudioSegment.from_mp3(audio_path)

# convert to expected format
if audio_segment.frame_rate != 16000: # 16 kHz
    audio_segment = audio_segment.set_frame_rate(16000)
if audio_segment.sample_width != 2:   # int16
    audio_segment = audio_segment.set_sample_width(2)
if audio_segment.channels != 1:       # mono
    audio_segment = audio_segment.set_channels(1)        
arr = np.array(audio_segment.get_array_of_samples())
arr = arr.astype(np.float32)/32768.0

result = audio_model.transcribe(arr, language=language, fp16=torch.cuda.is_available())
print(result['text'])

If your original audio is noisy then it is hard to expect good results for transcription.

Hi, I edited the code as you said, I don't get the error anymore but the model can't transcribe correctly because the audio that I obtain is very noisy. I tested it export back the audio in wav format. Do I normalized the audio in the correct way? can you provide me an example? — JayJona, Jun 14 '23 at 22:26

How to feed a numpy array as audio for whisper model

1 Answers1