How to use whisper.transcribe with ogg audio in a byte variable

Question

I am trying to use the transcribe method from OpenAI's whisper python module without loading the audio file from a file system. In my code I have downloaded an ogg audio file from a matrix server repository and now want to transcribe that. whisper.transcribe only wants a file, np.array or Tensor as input. Trying just to convert byte to either np.array or Tensor fails as the new arrays seems to be missing vital info. I was wondering if I can use some other code inside the whisper api to achieve what I want without writing my byte to file first and then later read it back from the file.

import whisper

# this gets an ogg file from a matrix server via mxc:// url as byte
audio = await self.client.download_media(evt.content.url)

model = load_model("base")
with open("my_file.ogg", "wb") as f:
   f.write(audio)
result = model.transcribe("my_file.ogg")

this code works but looks not the like finest programming idea, more like a quick hack; works but is ugly. So I wondering if there is a better option.

"All I found was bytes[] to ndarray but not just bytes." There's no such thing as `bytes[]` in Python; the `bytes` object *is* an array-like thing holding any number of bytes. — ShadowRanger, Dec 22 '22 at 20:30

J_H · Answer 1 · 2022-12-22T20:40:09.837

0

This doesn't seem like a hard problem, given that you alluded to the answer yourself.

A bit of inspection reveals that load_audio() will call ffmpeg.input(file), doing the work in a subprocess. So you need an audio file.

        temp = '/tmp/audio.ogg'
        with open(temp, 'wb') as f:
            f.write("some Magic Number header preamble?")
            f.write(audio)

        result = model.transcribe(temp)

I don't know exactly what you have in audio, so I'm not sure if some file metadata needs to appear as a preamble there. If you have some OGG-aware library conveniently at hand, use that to help with the file scribbling.

If you have another way of turning your in-memory audio data structure into a decompressed 16 kbps waveform, sure, go for it. But adhering to the public API exposed in whisper seems like the path of least resistance.

Please make some benchmark timings which separate the temp file I/O delays from the model inference delays, and post them here. I have trouble believing that I/O will dominate.

If it is significant, and you're persisting to an ext4fs directory, consider changing to a tmpfs backend, for in-memory performance.

edited Dec 22 '22 at 20:40

answered Dec 22 '22 at 20:27

J_H

17,926
4
24
44

as I wrote above I tested what is inside the `audio` by writing it into a file. Thought the naming gave it away that it is indeed .ogg, but to clarify it IS an .ogg file I get from the matrix server. Your answer does not make any sense to me that stating that I NEED an audio file when I specifically said, there are two other options the `transcribe` method of whisper is accepting; ndarray or Tensor. That is why my question stated "how do I get from byte to either ndarray or Tensor". – Tupsi Dec 23 '22 at 15:07
I wrote the answer prior to your update of the question. It's not enough to have an ndarray / Tensor -- it must be decompressed from OGG format and it must be sampled at the proper bit rate. It was clear from your question that the bytes you had in hand did not satisfy those constraints. Looking at the Whisper source code, the only way its public API offers to get to your end goal is by serializing bytes to filesystem, running an ffmpeg subprocess, and reading back the result. You voiced a concern about FS overhead. Please append benchmark timings, with / without FS, to the question. – J_H Dec 23 '22 at 15:15
I am not worried about benchmarks and timing. It just looked like "dirty coding" to me doing it that way. I messed around a bit with converting it to either Tensor or numby array and you are wright, it is not so easy as just other errors creep up. So I will leave it the way it works for now. Hopefully the whisper api adds other input sources as well. Granted what I wanna do looks like an edge case, so most likely not. Anyway, thanks for pointing that out. – Tupsi Dec 23 '22 at 16:07
changed the question in the hope it clarifies what I am looking for. – Tupsi Dec 23 '22 at 16:18

How to use whisper.transcribe with ogg audio in a byte variable

1 Answers1