Processing camera feed data on GPU (metal) and CPU (OpenCV) on iPhone

Question

I'm doing realtime video processing on iOS at 120 fps and want to first preprocess image on GPU (downsample, convert color, etc. that are not fast enough on CPU) and later postprocess frame on CPU using OpenCV.

What's the fastest way to share camera feed between GPU and CPU using Metal?

In other words the pipe would look like:

CMSampleBufferRef -> MTLTexture or MTLBuffer -> OpenCV Mat

I'm converting CMSampleBufferRef -> MTLTexture the following way

CVPixelBufferRef pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer);

// textureRGBA
{
    size_t width = CVPixelBufferGetWidth(pixelBuffer);
    size_t height = CVPixelBufferGetHeight(pixelBuffer);
    MTLPixelFormat pixelFormat = MTLPixelFormatBGRA8Unorm;

    CVMetalTextureRef texture = NULL;
    CVReturn status = CVMetalTextureCacheCreateTextureFromImage(NULL, _textureCache, pixelBuffer, NULL, pixelFormat, width, height, 0, &texture);
    if(status == kCVReturnSuccess) {
        textureBGRA = CVMetalTextureGetTexture(texture);
        CFRelease(texture);
    }
}

After my metal shader is finised I convert MTLTexture to OpenCV

cv::Mat image;
...
CGSize imageSize = CGSizeMake(drawable.texture.width, drawable.texture.height);
int imageByteCount = int(imageSize.width * imageSize.height * 4);
int mbytesPerRow = 4 * int(imageSize.width);

MTLRegion region = MTLRegionMake2D(0, 0, int(imageSize.width), int(imageSize.height));
CGSize resSize = CGSizeMake(drawable.texture.width, drawable.texture.height);
[drawable.texture getBytes:image.data bytesPerRow:mbytesPerRow  fromRegion:region mipmapLevel:0];

Some observations:

1) Unfortunately MTLTexture.getBytes seems expensive (copying data from GPU to CPU?) and takes around 5ms on my iphone 5S which is too much when processing at ~100fps

2) I noticed some people use MTLBuffer instead of MTLTexture with the following method: metalDevice.newBufferWithLength(byteCount, options: .StorageModeShared) (see: Memory write performance - GPU CPU Shared Memory)

However CMSampleBufferRef and accompanying CVPixelBufferRef is managed by CoreVideo is guess.

The GPU is not supported for all resolutions. I know, It's not your answer . I just give an information about GPU. — HariKrishnan.P, Jun 10 '16 at 13:05
have you tried GPUImage https://github.com/BradLarson/GPUImage — Sunil Sharma, Jun 13 '16 at 13:48
I tried GPUImage but the biggest bottlenect is transfering data from GPU to CPU. GPUImage uses OpenGL under the good and opposite to Metal API cannot have shared memory. — pzo, Jun 29 '16 at 16:32
I would look for a way to do the OpenCV work on the GPU too. Some parts of OpenCV are available in MetalPerformanceShaders.framework, mostly the image processing stuff. iOS 10 adds Convolutional neural networking. If you need other operators, file a feature request bug with Apple. — Ian Ollmann, Aug 10 '16 at 14:58
I am trying to apply a simple vignette filter to a live camera feed using metal. The results are pretty slow and laggy, please check this if you can tell me what is missing:https://stackoverflow.com/q/53898780/1364053 — nr5, Dec 23 '18 at 02:53

Gary · Accepted Answer · 2016-08-11T15:03:27.333

The fastest way to do this is to use a MTLTexture backed by a MTLBuffer; it is a special kind of MTLTexture that shares memory with a MTLBuffer. However, your C processing (openCV) will be running a frame or two behind, this is unavoidable as you need to submit the commands to the GPU (encoding) and the GPU needs to render it, if you use waitUntilCompleted to make sure the GPU is finished that just chews up the CPU and is wasteful.

So the process would be: first you create the MTLBuffer then you use the MTLBuffer method "newTextureWithDescriptor:offset:bytesPerRow:" to create the special MTLTexture. You need to create the special MTLTexture beforehand (as an instance variable), then you need to setup up a standard rendering pipeline (faster than using compute shaders) that will take the MTLTexture created from the CMSampleBufferRef and pass this into your special MTLTexture, in that pass you can downscale and do any colour conversion as necessary in one pass. Then you submit the command buffer to the gpu, in a subsequent pass you can just call [theMTLbuffer contents] to grab the pointer to the bytes that back your special MTLTexture for use in openCV.

Any technique that forces a halt in the CPU/GPU behaviour will never be efficient as half the time will be spent waiting i.e. the CPU waits for the GPU to finish and the GPU has to wait also for the next encodings (when the GPU is working you want the CPU to be encoding the next frame and doing any openCV work rather than waiting for the GPU to finish).

Also, when people normally refer to real-time processing they usually are referring to some processing with real-time feedback (visual), all modern iOS devices from the 4s and above have a 60Hz screen refresh rate, so any feedback presented faster than that is pointless but if you need 2 frames (at 120Hz) to make 1 (at 60Hz) then you have to have a custom timer or modify CADisplayLink.

Good tip that GPU rendering (texture shaders) might be limited to 60fps - makes sense. I actually need the smallest latency possible - I have custom natural user interface that uses sound as feedback to user instead of rendering to display. I don't mind CPU waiting for GPU to finish - I just want to move some preprocessing to GPU (adjust contrast, filter color resize) they are very fast on GPU and pretty slow on CPU (event with NEON) considering my tight computational budget. Can't move (seems impossible?) other parts to GPU though such as contours analysis. Seems GPU is a dead end for me. — pzo, Aug 12 '16 at 15:20
I don't think it is a dead end, at the very least it would be relatively easy to setup a pipeline running at 60Hz, where you encode and do your contour analysis every frame and concurrently the GPU does the necessary preprocessing, once you have it going and optimised at 60Hz (Metal Frame Debugger and Metal System Trace are very useful tools) try jack it up to 120Hz. I never tried to use timers or CADisplayLink that fast so I can't help you there, but check out: http://stackoverflow.com/questions/23885638/change-interval-of-cadisplaylink. — Gary, Aug 13 '16 at 20:15
Also, I'm not very familiar with contour analysis, but using the compute functionality of Metal you may be able to carry it out, as contrast adjustment or resizing are not going to impact the GPU (if the filter is complex use a LUT). Even with standard vertex and fragment shaders there are often tricks to be able to do non-friendly GPU stuff on the GPU, I implemented a connected component labelling algorithm using Metal and it wasn't too far off the C version for small images — Gary, Aug 13 '16 at 20:28
I am trying to apply a simple vignette filter to a live camera feed using metal. The results are pretty slow and laggy, please check this if you can tell me what is missing: https://stackoverflow.com/questions/53898780/how-to-get-high-performance-with-ios-metal-and-cifilter-combination — nr5, Dec 23 '18 at 18:38

Processing camera feed data on GPU (metal) and CPU (OpenCV) on iPhone

1 Answers1