So I want to open an mp3 using AudioSegment, then I want to convert the AudioSegment object to numpy array and use this numpy array as input for whisper model, I followed this question How to create a numpy array from a pydub AudioSegment? but non of the result was helpful since I get always error like
Traceback (most recent call last):
File "E:\Programmi\PythonProjects\whisper_real_time\test\converting_test.py", line 19, in <module>
result = audio_model.transcribe(arr_copy, language="en", word_timestamps=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\Programmi\PythonProjects\whisper_real_time\venv\Lib\site-packages\whisper\transcribe.py", line 121, in transcribe
mel = log_mel_spectrogram(audio, padding=N_SAMPLES)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\Programmi\PythonProjects\whisper_real_time\venv\Lib\site-packages\whisper\audio.py", line 146, in log_mel_spectrogram
audio = F.pad(audio, (0, padding))
^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 86261939712 bytes.
This error is strange because if I provide directly the file like below I get no problems
result = audio_model.transcribe("../audio_test_files/1001_IEO_DIS_HI.mp3", language="en", word_timestamps=True,
fp16=torch.cuda.is_available())
This is the code I wrote
from pydub import AudioSegment
import numpy as np
import whisper
import torch
audio = AudioSegment.from_mp3("../audio_test_files/1001_IEO_DIS_HI.mp3")
dtype = getattr(np, "int{:d}".format(
audio.sample_width * 8)) # Or could create a mapping: {1: np.int8, 2: np.int16, 4: np.int32, 8: np.int64}
arr = np.ndarray((int(audio.frame_count()), audio.channels), buffer=audio.raw_data, dtype=dtype)
arr_copy = arr.copy()
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading whisper...")
audio_model = whisper.load_model("small", download_root="../models",
device=device)
print(f"Transcribing...")
result = audio_model.transcribe(audio=arr_copy, language="en", word_timestamps=True,
fp16=torch.cuda.is_available()) # , initial_prompt=result.get('text', ""))
text = result['text'].strip()
print(text)
how can I do it?
--------EDIT-------- I edited the code, and now I use the code below. I don't have the error that I had before but the model seems to don't transcribe correctly. I tested what audio I was passing to the model exporting back the wav file, I played it and there is a lot of noise, I can't understand what they are saying so that's why the model does not transcribe. Are the passage of normalization that I am doing ok?
from pydub import AudioSegment
import numpy as np
import whisper
import torch
language = "en"
model = "medium"
model_path = "../models"
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading whisper {model} model {language}...")
audio_model = whisper.load_model(model, download_root=model_path, device=device)
# load wav file with pydub
audio_path = "20230611-004146_audio_chunk.wav"
audio_segment = AudioSegment.from_wav(audio_path)
#audio_segment = audio_segment.low_pass_filter(1000)
# get sample rate
sample_rate = audio_segment.frame_rate
arr = np.array(audio_segment.get_array_of_samples())
arr_copy = arr.copy()
arr_copy = torch.from_numpy(arr_copy)
arr_copy = arr_copy.to(torch.float32)
# normalize
arr_copy = arr_copy / 32768.0
# to device
arr_copy = arr_copy.to(device)
print(f"Transcribing...")
result = audio_model.transcribe(arr_copy, language=language, fp16=torch.cuda.is_available())
text = result['text'].strip()
print(text)
waveform = arr_copy.cpu().numpy()
audio_segment = AudioSegment(
waveform.tobytes(),
frame_rate=sample_rate,
sample_width=waveform.dtype.itemsize,
channels=1
)
audio_segment.export("test.wav", format="wav")