Capturing and displaying live camera content on Windows

Question

I am developing a Windows application that is able to display a high-quality video feed, record it or take photos from it, and edit them later (up to 4K, in the near future maybe 8K). I currently have a working product, using WPF (C#). For capturing and displaying video, I used the AForge.NET library.

My problem is that the application is really slow, with the main performance hit coming from video rendering. Apparently the only way to do this, is to have a callback from the AForge library, providing a new frame every time one is available. That frame is then placed as an image inside an Image element. I believe you can see where the performance hit comes from, especially for high-res imagery.

My experience with WPF and these enormous libraries has made me rethink how I want to program in general; I do not want to make bad software which takes up everyone's time by being slow (I refer to the Handmade network for more on "why?".

The problem is, camera capture and display was hell in WPF C#, but I do not seem to be better of anywhere else (on Windows, that is). An option would be for me to use mostly C++ and DirectShow. This is an okay-ish solution, but feels outdated in terms of performance, and is built upon Microsoft's COM system, which I prefer to avoid. There are options to render with hardware using Direct3D, but DirectShow and Direct3D do not play nicely together.

I have researched how other applications were able to achieve this. VLC uses DirectShow, but this only shows that DirectShow suffers from large latency. I assume this is because VLC was not intended for real-time purposes. OBS studio uses whatever QT uses, but I was unable to find how they do it. OpenCV grabs frames and blits them to the screen, not efficient at all, but that suffices for the OpenCV audience. Lastly, the integrated webcam app from Windows. For some reason this app is able to record and play back in real time, without a large performance hit. I was not able to figure out how they did this, nor did I find any other solution achieving comparable results to that tool.

TLDR; So my questions are: How would I go about efficiently capturing and rendering a camera stream, preferably hardware accelerated; Is it possible to do this on Windows without going through Directshow; And lastly, do I ask to much of commodity devices when I want them to process 4K footage in real-time?

I have not found anyone doing this in a way that suffices my needs; this makes me feel both desperate and guilty at the same time. I would have preferred to not bother StackOverflow with this problem.

Many thanks in advance, for an answer, or advice on this topic in general.

What about MediaFoundation the newer API for video capture from Microsoft. — user123, Feb 10 '21 at 23:27
See here https://stackoverflow.com/questions/65640023/project-and-build-structure-for-microsoft-directshow-based-virtual-webcam-applic The second answer is a complete video capture code using ms-media-foundation to capture from a webcam and do the rendering. — user123, Feb 10 '21 at 23:32
Thank you for your answer; I will assess its performance after I implement that, and if it is in fact fast enough, I'll accept your answer as the correct one :) — Stijn De Pauw, Feb 11 '21 at 07:49
The code is pretty messy. Camera configuration is not in the code because you can configure your camera to output several formats of image in mediafoundation. The rendering part of the code is very lazy because my application doesn't require fast rendering so I didn't take the time to do something better. But in the end the IMFMediaBuffer lock method will give you a pointer to the RGB data. So once you have the RGB data it is basically 3 chars per pixel. You are then free to implement your own rendering algorithm or to use Direct3D acceleration. — user123, Feb 11 '21 at 13:17
As to the recording part you won't get faster then this since it is native C++ WinAPI code. — user123, Feb 11 '21 at 13:17

score 4 · Answer 1 · answered Feb 11 '21 at 20:49

Here's a complete reproducible example code which does rendering using GDI+ and captures video using MediaFoundation. It should work out of the box on visual studio and should not have any kind of memory leaks due to automatic memory management using unique_ptr and CComPtr. Also, your camera will output its default video format using this code. If needed, you can always set the video format using the following: https://learn.microsoft.com/en-us/windows/win32/medfound/how-to-set-the-video-capture-format

#include <windows.h>
#include <mfapi.h>
#include <iostream>
#include <mfidl.h>
#include <mfreadwrite.h>
#include <dshow.h>
#include <dvdmedia.h>
#include <gdiplus.h>
#include <atlbase.h>
#include <thread>
#include <vector>

#pragma comment(lib, "mfplat")
#pragma comment(lib, "mf")
#pragma comment(lib, "mfreadwrite")
#pragma comment(lib, "mfuuid")
#pragma comment(lib, "gdiplus")

void BackgroundRecording(HWND hWnd, CComPtr<IMFSourceReader> pReader, int videoWidth, int videoHeight) {
    DWORD streamIndex, flags;
    LONGLONG llTimeStamp;

    Gdiplus::PixelFormat pixelFormat = PixelFormat24bppRGB;
    Gdiplus::Graphics* g = Gdiplus::Graphics::FromHWND(hWnd, FALSE);

    while (true) {
        CComPtr<IMFSample> pSample;

        HRESULT hr = pReader->ReadSample(MF_SOURCE_READER_FIRST_VIDEO_STREAM, 0, &streamIndex, &flags, &llTimeStamp, &pSample);
        if (!FAILED(hr)) {
            if (pSample != NULL) {
                CComPtr<IMFMediaBuffer> pBuffer;
                hr = pSample->ConvertToContiguousBuffer(&pBuffer);
                if (!FAILED(hr)) {
                    DWORD length;
                    hr = pBuffer->GetCurrentLength(&length);
                    if (!FAILED(hr)) {
                        unsigned char* data;
                        hr = pBuffer->Lock(&data, NULL, &length);
                        if (!FAILED(hr)) {
                            std::unique_ptr<unsigned char[]> reversedData(new unsigned char[length]);
                            int counter = length - 1;
                            for (int i = 0; i < length; i += 3) {
                                reversedData[i] = data[counter - 2];
                                reversedData[i + 1] = data[counter - 1];
                                reversedData[i + 2] = data[counter];
                                counter -= 3;
                            }
                            std::unique_ptr<Gdiplus::Bitmap> bitmap(new Gdiplus::Bitmap(videoWidth, videoHeight, 3 * videoWidth, pixelFormat, reversedData.get()));
                            g->DrawImage(bitmap.get(), 0, 0);
                        }
                    }
                }
            }
        }
    }
}

LRESULT CALLBACK WindowProc(HWND hwnd, UINT uMsg, WPARAM wParam, LPARAM lParam)
{
    switch (uMsg)
    {
    case WM_PAINT:
    {
        PAINTSTRUCT ps;
        HDC hdc = BeginPaint(hwnd, &ps);
        FillRect(hdc, &ps.rcPaint, (HBRUSH)(COLOR_WINDOW + 1));
        EndPaint(hwnd, &ps);
    }
    break;
    case WM_CLOSE:
    {
        DestroyWindow(hwnd);
    }
    break;
    case WM_DESTROY:
    {
        PostQuitMessage(0);
    }
    break;
    default:
        return DefWindowProc(hwnd, uMsg, wParam, lParam);
        break;
    }
}

int WINAPI wWinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, PWSTR pCmdLine, int nCmdShow) {
    HRESULT hr = MFStartup(MF_VERSION);

    Gdiplus::GdiplusStartupInput gdiplusStartupInput;
    ULONG_PTR gdiplusToken;
    GdiplusStartup(&gdiplusToken, &gdiplusStartupInput, NULL);

    CComPtr<IMFSourceReader> pReader = NULL;
    CComPtr<IMFMediaSource> pSource = NULL;
    CComPtr<IMFAttributes> pConfig = NULL;
    IMFActivate** ppDevices = NULL;

    hr = MFCreateAttributes(&pConfig, 1);
    if (FAILED(hr)) {
        std::cout << "Failed to create attribute store" << std::endl;
    }

    hr = pConfig->SetGUID(MF_DEVSOURCE_ATTRIBUTE_SOURCE_TYPE, MF_DEVSOURCE_ATTRIBUTE_SOURCE_TYPE_VIDCAP_GUID);
    if (FAILED(hr)) {
        std::cout << "Failed to request capture devices" << std::endl;
    }

    UINT32 count = 0;
    hr = MFEnumDeviceSources(pConfig, &ppDevices, &count);
    if (FAILED(hr)) {
        std::cout << "Failed to enumerate capture devices" << std::endl;
    }

    hr = ppDevices[0]->ActivateObject(IID_PPV_ARGS(&pSource));
    if (FAILED(hr)) {
        std::cout << "Failed to connect camera to source" << std::endl;
    }

    hr = MFCreateSourceReaderFromMediaSource(pSource, pConfig, &pReader);
    if (FAILED(hr)) {
        std::cout << "Failed to create source reader" << std::endl;
    }

    for (unsigned int i = 0; i < count; i++) {
        ppDevices[i]->Release();
    }
    CoTaskMemFree(ppDevices);

    CComPtr<IMFMediaType> pType = NULL;
    DWORD dwMediaTypeIndex = 0;
    DWORD dwStreamIndex = 0;
    hr = pReader->GetNativeMediaType(dwStreamIndex, dwMediaTypeIndex, &pType);
    LPVOID representation;
    pType->GetRepresentation(AM_MEDIA_TYPE_REPRESENTATION, &representation);
    GUID subType = ((AM_MEDIA_TYPE*)representation)->subtype;
    BYTE* pbFormat = ((AM_MEDIA_TYPE*)representation)->pbFormat;
    GUID formatType = ((AM_MEDIA_TYPE*)representation)->formattype;
    int videoWidth = ((VIDEOINFOHEADER2*)pbFormat)->bmiHeader.biWidth;
    int videoHeight = ((VIDEOINFOHEADER2*)pbFormat)->bmiHeader.biHeight;

    WNDCLASS wc = { };
    wc.lpfnWndProc = WindowProc;
    wc.hInstance = hInstance;
    wc.lpszClassName = L"Window";
    RegisterClass(&wc);
    HWND hWnd = CreateWindowExW(NULL, L"Window", L"Window", WS_OVERLAPPEDWINDOW, 0, 0, videoWidth, videoHeight, NULL, NULL, hInstance, NULL);
    ShowWindow(hWnd, nCmdShow);

    std::thread th(BackgroundRecording, hWnd, pReader, videoWidth, videoHeight);
    th.detach();

    MSG msg = { };
    while (GetMessage(&msg, NULL, 0, 0))
    {
        TranslateMessage(&msg);
        DispatchMessage(&msg);
    }
    pSource->Shutdown();
    Gdiplus::GdiplusShutdown(gdiplusToken);
    return 0;
}

score 4 · Accepted Answer · answered Feb 15 '21 at 19:45

Your question is about combination of several technologies: video capture, video presentation and what it takes to connect the two together.

On Windows there are two video related APIs (if we don't take ancient VfW into consideration): DirectShow and Media Foundation. Both APIs have underlying layers, which are mostly shared and for this reason both DirectShow and Media Foundation offer similar video capture capabilities and performance. Both APIs offer you good video capture latency, reasonably low. As things stand now use of DirectShow is not recommended since the API is at its end of life, and is mostly abandoned technology. Same time, you would probably find DirectShow better documented, more versatile and provided with orders of magnitude better supplementary materials and third party software items. You mentioned a few libraries and they all are built on top of one of the mentioned technologies (VfW, DirectShow, Media Foundation) with implementation quality inferior to original operating system API.

Practically, you capture video with either of the two, preferably Media Foundation as the current technology.

In my opinion the most important part of your question is how to organize video rendering. Performance wise it is essential to take advantage of hardware acceleration and in this context it is important what technologies your application is built on and what are the available integration options for video presentation/embedding. For a .NET desktop application you would be interested in either mixing Direct3D 11/12 with .NET or using MediaPlayerElement control and a research how to inject video frames into it. As mentioned above even though third party libraries are available, you should not expect them to solve problems in an appropriate way. You are interested to at least understand the data flow in the video pipeline.

Then you have a problem how to connect video capture (not accelerated by video hardware) and hardware accelerated video rendering. There can be multiple solutions here, but important is that DirectShow's support for hardware acceleration is limited and stopped its evolution at Direct3D 9, which sounds as outdated nowadays. This is one another reason to say farewell to this - no doubt - excellent piece of technology. You are interested in investigating your options in placing captured video content into Direct3D 11/Direct3D 12/Direct2D as soon as possible and utilize standard current technologies for the following processing. Actual technologies might depend: it can be Media Foundation, Direct3D 11/12 or mentioned MediaPlayerElement controls, as well as a few other options like Direct2D which are also good. On the way to great or at least reasonable performance you are interested to minimize use of third party - even though popular - libraries, even if they have buzz words in their titles.

4K realtime footage can be captured and processed in real time, however you normally either have professional video capture hardware or the content is compressed which you are supposed to decompress with hardware acceleration.

score 1 · Answer 3 · answered Jul 30 '21 at 12:30

1

I have found what seems to be the only solution with the software that I currently have.

To increase performance, I ofcourse needed hardware acceleration, but the problem was that not too many options are compatible with WPF. I found that Direct3D9 does the job, although outdated. It is possible to do everything in D3D11 or further and share the result with a D3D9 surface, but I chose to work with D3D9 all the way.

For capturing itself, I now use MediaFoundation iself instead of DirectShow or any capturing library. This seems to work well, and it allows for easy access to audio and video combinations.

A callback collects videoframes and writes them to a D3D9 texture, which in turn is used as input for a pixelshader (= fragment shader), and rendered to a rectangle. The reason for this is to enable format conversion from the camera's native NV12 format (other formats work too).

If anyone is ever interested on how this was done in more detail, feel free to ask in the comments on this answer. That could save a lot of time :)

TL;DR: WPF only allows D3D9 content, I use MediaFoundation for capturing.

answered Jul 30 '21 at 12:30

Stijn De Pauw

332
2
9

Do you have any link to sample code to display a frame in wpf window using D3D9? If you could share any information will helpful. – ThunderBird Mar 04 '22 at 16:25
I am blocked in displaying a YUY2 Frame in WPF window. – ThunderBird Mar 04 '22 at 16:31
1

I can send some but you won't be happy, because it breaks on all modern computers and you do not want to mess with that. Instead I recommend working with OpenGL! better performance than d3d9 and better documented. Use https://github.com/opentk/GLWpfControl. If it is not too late, do not use wpf but go straight to something in C++ where support for MF is still going. – Stijn De Pauw Mar 05 '22 at 11:00
Thanks for helping me, Now actually i am creating preview a app for a camera has only YUY2 type. unfortunately media transform implementation to convert the YUY2 frame to RGB24 won't working i am not sure why. Please take a look on this question [link](https://stackoverflow.com/questions/71313871/convert-imfmediabuffer-data-having-data-type-of-yuy2-to-rgb24-or-rgb32/71353985#71353985). – ThunderBird Mar 05 '22 at 17:58

Capturing and displaying live camera content on Windows

3 Answers3

Linked