How to extract video and audio from ffmpeg stream in python

Question

I want to get separate video and separate audio objects from ffmpeg stream (python)

To do this, I run it like this on my rasbery pi:

ffmpeg -f alsa -thread_queue_size 1024 -channels 1 -i hw:2,0 -thread_queue_size 1024 -s 1920x1080 -i /dev/video0 -listen 1 -f matroska -vcodec libx264 -preset veryfast -tune zerolatency http://:8080

From the server side, I connect to the stream like this. I know how to get sound from this packet object, but I don’t understand how to get a video frame from the packet object? I would like to present the video stream as a picture by picture and a separate sound for audio and video processing in the program.

    process = (
        ffmpeg.input("http://192.168.1.78:8080").output(
            '-',
            format='matroska',
            acodec='libvorbis',
            vcodec='libx264'
        ).run_async(pipe_stdout=True, pipe_stderr=True)
    )
    while process.poll() is None:
        packet = process.stdout.read(4096)

Using python 3.9 ffmpeg-python==0.2.0

P.S. Essentially I need a numpy array of video and separate audio for each package.

Separating the audio and video is *literally* the first example you see on the front page of the [documentation](https://kkroening.github.io/ffmpeg-python/), within the first screen of text, which I found as the first result by putting `ffmpeg-python documentation` [into a search engine](https://duckduckgo.com/?q=ffmpeg-python+documentation). Please try to [research](https://meta.stackoverflow.com/questions/261592) questions before asking. You don't get a frame from a packet of the combined stream; demux *first*. You can search on that page for `Process video frame-by-frame using numpy`, too. — Karl Knechtel, Jan 22 '22 at 14:11
I completely agree with you that you need to read the documentation before asking, but in my opinion there is not enough information about what kind of objects (audio, video) and how to work with them, I don’t see. As well as I did not see an example of working with streams. There are examples only with files. That's why I decided to ask here. If you have at least partial answers, then please push on the search for information) — user5285766, Jan 22 '22 at 17:44
I had ranted about how this is indeed explained by the documentation, but I deleted it. I still think you're making this out to be much harder than it actually is, but I'll write the answer anyway so you can see. — Karl Knechtel, Jan 23 '22 at 09:30
There is additional documentation on the [github page](https://github.com/kkroening/ffmpeg-python) which you might find more useful, too. It has illustrations and everything. — Karl Knechtel, Jan 23 '22 at 10:04

kesh · Answer 1 · 2022-01-31T14:55:39.607

Essentially I need a numpy array of video and separate audio for each package.

The difficult part is how to pipe 2 different streams. And your approach depends on your OS.

Linux/MacOS

(I'm a Windows guy for the most part so take this with a grain of salt)

Use pass_fds option of subprocess.Popen to create the second "stdout." See this link for an example of how to passing additional pipe via pass_fds.
On FFmpeg command line, use 'pipe:3' to make FFmpeg write the 2nd output stream to the extra pipe. For example:

ffmpeg -i input_url -f rawvideo -pix_fmt rgb24 - \
                    -f s16le                   pipe:3

In Python, dispatch 2 threads to read both pipes simultaneously w/out deadlock.

(you may need to specify the codecs and more options, but you get the idea.)

Windows/cross-platform

pass_fds is not available in Windows
Use a container, which can transport both raw video and audio streams (e.g., AVI):

ffmpeg -i input_url -f avi -c:v rawvideo -pix_fmt rgb24 -c:a pcm_s16le -

Python is responsible to demux the AVI stream by reading the # of bytes specified by a RIFF file chunk at a time. First decode the header info then video and audio chunks will alternate till the end of stream. AVI format is not too complicated, See the Wikipedia entry

- I happen to know all this because I'm currently developing this exact mechanism for my ffmpegio library. Release of this feature is still a little away, but you can check out my test implementation: tests/test_media.py and src/ffmpegio/utils/avi.py. Please note that the test is written for a file read instead of stream read. So, test_media.py needs to be modified (use Popen instead of run (line 9) and change BytesIO(out.stdout) to out.stdout on line 30)

(addendum) ffmpegio example

I managed to get the AVI streaming mechanism to work with my ffmpegio library, so here is a couple examples if you are interested in giving it a try.

Your code in OP suggests packet-wise processing, setting blocksize=0 for this:

import ffmpegio

with open("http://192.168.1.78:8080",'rva',blocksize=0) as stream:
    for st_spec, chunk in stream:
        # st_spec: stream specifier string: 'v:0', 'a:0', etc.
        # chunk: retrieved avi chunk data as numpy array
        # e.g., [1xheightxwidthx3] for video or [1357x1] for mono audio
        your_process(t_spec, chunk)

if you want ffmpegio to gather data per stream and output whenever one of the streams has X number of blocks, set blocksize to a positive int:

with open("http://192.168.1.78:8080",'rva',blocksize=1,ref_stream='v:0') as stream:
    for frames in stream:
        # frames : dict of retrieved raw data of video & audio streams
        # e.g., {'v:0': [1xheightxwidthx3] ndarray, 'a:0': [1357x1] ndarray}
        your_process(frames)

You can add (dashless) FFmpeg options to the open argument list as you wish. Input options, append the option name with "_in".

My quick benchmark on my old laptop suggests that it's decoding the video & audio streams at x3 speed, so it should be able to handle live stream on a modern rig.

Finally, my standard disclaimer. The library is very young, please report any issues on GitHub and I'll address them asap.

In fact, now I'm figuring out in what format to finally transfer data, you answered my question. I also read about AVI. To be precise, I don’t understand by what laws bytes are converted into frames for further processing num array — user5285766, Jan 28 '22 at 20:36
I'm using alsa format for audio because I need to use a microphone. For some reason, other formats can not read from the microphone. — user5285766, Jan 28 '22 at 21:08
"To be precise, I don’t understand by what laws bytes are converted into frames for further processing num array" In general, a raw format has a certain number of bytes per pixel (depending on the detail of colour information; most commonly 3 bytes, so that there is an 8-bit value for each of red, green and blue colour components). The data is simply a sequence of the values for each pixel, in order. The point of AVI is that it is a *container* format; it doesn't specify the encoding of the audio and video streams, but only how to store them both in the same file. — Karl Knechtel, Jan 28 '22 at 21:12
The idea in this answer is that we are using the `ffmpeg` invocation on the PI to do the decoding, and provide a simpler data format to the Python code. My answer has the Python code do decoding as well as demuxing, by asking ffmpeg via the bindings. It amounts to more or less the same; of course, we will be using ffmpeg from the script *anyway*. — Karl Knechtel, Jan 28 '22 at 21:15
@KarlKnechtel - OP mentioned reading data from microphone right above, so the input could very well be a stream, in which case invoking FFmpeg twice (which I think is your solution, sorry if I misunderstood) may not be an option. — kesh, Jan 28 '22 at 21:23
It's unclear to me exactly what you propose will happen on the Python side, in your approach, actually. — Karl Knechtel, Jan 29 '22 at 03:57
```ffmpeg -f alsa -thread_queue_size 1024 -channels 1 -i hw:2,0 -thread_queue_size 1024 -s 1920x1080 -i /dev/video0 -listen 1 -f matroska -vcodec libx264 -preset veryfast -tune zerolatency http://:8080``` for now I am runnnig like this a client. First stream from mic, second from cam — user5285766, Jan 29 '22 at 10:42
@KarlKnechtel - If FFmpeg pipes raw video & audio frames to Python in AVI container, Python "reader" can be programmed to process the AVI byte data and separate resulting video & audio chunks (the latter containing multiple audio samples) and return a pair of video array [nframes x height x width x 3] and audio array [nsamples x channels]. It is my impression that this is what OP wants to do. — kesh, Jan 29 '22 at 13:16
@user5285766 - edited my answer with `ffmpgio` example. If this fits your purpose, please give it a try — kesh, Jan 31 '22 at 14:56

score 3 · Answer 2 · answered Jan 23 '22 at 09:59

It's worth understanding off the top that these are bindings for FFMpeg, which is doing all the work. It's useful to understand the FFMpeg program itself, in particular the command-line arguments it takes. There is a lot there, but you can learn it a piece at a time according to your actual needs.

Your existing input stream:

process = (
    ffmpeg.input("http://192.168.1.78:8080").output(
        '-',
        format='matroska',
        acodec='libvorbis',
        vcodec='libx264'
    ).run_async(pipe_stdout=True, pipe_stderr=True)
)

Let's compare that to the one in the example partway down the documentation, titled "Process video frame-by-frame using numpy:" (I reformatted it a little to match):

process1 = (
    ffmpeg.input(in_filename).output(
        'pipe:',
        format='rawvideo',
        pix_fmt='rgb24'
    ).run_async(pipe_stdout=True)
)

It does not matter whether we use a file or a URL for our input source - ffmpeg.input figures that out for us, and at that point we just have an ffmpeg.Stream either way. (Just like we could use either for a -i argument for the command-line ffmpeg program.)

The next step is to specify how the stream outputs (i.e., what kind of data we will get out when we read from the stdout of the process. The documentation's example uses 'pipe:' to specify writing to stdout; this should be the same as '-'. The documentation's example does not pipe_stderr, but that shouldn't matter since we do not plan to read from the stderr either way.

The key difference is that we specify a format that we know how to handle. 'rawvideo' means exactly what it sounds like, and is suitable for reading the data into a Numpy array. (This is what we would pass as a -f option at the command line.)

The pix_fmt keyword parameter means what it sounds like: 24 bits per pixel, representing red, green and blue components. There are a bunch of pre-defined values for this which you can see with ffmpeg -pix_fmts. And, yes, you would specify this as -pix_fmt at the command line.

Having created such an input stream, we can read from its stdout and create Numpy arrays from each piece of data. We don't want to read data in arbitrary "packet" sizes for this; we want to read exactly as much data as is needed for one frame. That will be the width of the video, times the height, times three (for RGB components at 1 byte each). Which is exactly what we see later in the example:

while True:
    in_bytes = process1.stdout.read(width * height * 3)
    if not in_bytes:
        break
    in_frame = (
        np
        .frombuffer(in_bytes, np.uint8)
        .reshape([height, width, 3])
    )

Pretty straightforward: we iteratively read that amount of data, check for the end of the stream, and then create the frame with standard Numpy stuff.

Notice that at no point here did we attempt to separate audio and video - this is because a rawvideo codec, as the name implies, won't output any audio data. We don't need to select the video from the input stream in order to filter it out. But we can - it's as simple as shown at the top of the documentation: ffmpeg.input(...).video.output(...). Similarly for audio.

We can process the audio by creating a separate stream. Choose an appropriate audio format, and specify any other needed arguments. So, perhaps something like:

process2 = (
    ffmpeg.input(in_filename).output(
        'pipe:',
        format='s16le',
        sample_rate='44100'
    ).run_async(pipe_stdout=True)
)

I haven't actually tested this, but I *have* used ffmpeg a few times at the command line before, and interacted with it via other tools. Thanks, btw, for the reminder that I have a very good use for bindings such as this myself. — Karl Knechtel, Jan 23 '22 at 10:00
Thank you for your post also). I'll try to reformulate the question a bit, if you don't mind. What I end up needing is in an infinite loop (or not .poll()) an audio frame as a numpy array and a video frame(jpeg) = num array. In fact, I want 2 arrays. — user5285766, Jan 29 '22 at 10:53
Can you suggest the best way to do this? I saw an example in the documentation, but somehow it didn’t make sense. There, not through tuples, I simply did not find how to read (audio, video objects) if i run out.run() an example. https://kkroening.github.io/ffmpeg-python/ I have implemented it through process now. — user5285766, Jan 29 '22 at 11:08

score -2 · Answer 3 · answered Jan 22 '22 at 13:55

-2

Follow the steps bellow:

pip install ffmpeg moviepy
import moviepy.editor as mp
my_clip = mp.VideoFileClip(r"videotest.mov")
my_clip.audio.write_audiofile(r"my_result.mp3")

answered Jan 22 '22 at 13:55

Konstantinos K.

181
1
9

Please do not just recommend a different third-party library. That isn't engaging with the problem that was asked about. This site is about the code, after all, not the (e.g.) video being processed. – Karl Knechtel Jan 22 '22 at 14:14

How to extract video and audio from ffmpeg stream in python

3 Answers3

Linux/MacOS

Windows/cross-platform

(addendum) ffmpegio example

Linked