iOS Accelerate Framework vImage - Performance improvement?

Question

I've been working with OpenCV and Apple's Accelerate framework and find the performance of Accelerate to be slow and Apple's documentation limited. Let's take for example:

void equalizeHistogram(const cv::Mat &planar8Image, cv::Mat &equalizedImage)
{
    cv::Size size = planar8Image.size();
    vImage_Buffer planarImageBuffer = {
        .width = static_cast<vImagePixelCount>(size.width),
        .height = static_cast<vImagePixelCount>(size.height),
        .rowBytes = planar8Image.step,
        .data = planar8Image.data
    };

    vImage_Buffer equalizedImageBuffer = {
        .width = static_cast<vImagePixelCount>(size.width),
        .height = static_cast<vImagePixelCount>(size.height),
        .rowBytes = equalizedImage.step,
        .data = equalizedImage.data
    };

    TIME_START(VIMAGE_EQUALIZE_HISTOGRAM);
    vImage_Error error = vImageEqualization_Planar8(&planarImageBuffer, &equalizedImageBuffer, kvImageNoFlags);
    TIME_END(VIMAGE_EQUALIZE_HISTOGRAM);
    if (error != kvImageNoError) {
        NSLog(@"%s, vImage error %zd", __PRETTY_FUNCTION__, error);
    }
}

This call takes roughly 20ms. Which has the practical meaning of being unusable in my application. Maybe equalization of the histogram is inherently slow, but I've also tested BGRA->Grayscale and found OpenCV can do it in ~5ms and vImage takes ~20ms.

In testing of other functions I found a project that made a simple slider app with a blur function (gist) that I cleaned up to test. Roughly ~20ms as well.

Is there some trick to getting these functions to be faster?

While some don't like the idea of asking a question regarding performance of a framework aimed at performance, I think the question has a lot of value. Apple touts Accelerate as a way to get high-performance code easily, but the documentation is very thin on the use of Accelerate and SO could improve that by getting some code examples related to this topic. — Cameron Lowell Palmer, Feb 26 '15 at 12:31

James Bush · Accepted Answer · 2015-05-25T23:25:29.483

To get 30 frames per second using the equalizeHistogram function, you must deinterleave the image (convert from ARGBxxxx to PlanarX) and equalize ONLY R(ed)G(reen)B(lue); if you equalize A(lpha), the frame rate will drop to at least 24.

Here is the code that does exactly what you want, as fast as you want:

- (CVPixelBufferRef)copyRenderedPixelBuffer:(CVPixelBufferRef)pixelBuffer {

CVPixelBufferLockBaseAddress( pixelBuffer, 0 );

unsigned char *base = (unsigned char *)CVPixelBufferGetBaseAddress( pixelBuffer );
size_t width = CVPixelBufferGetWidth( pixelBuffer );
size_t height = CVPixelBufferGetHeight( pixelBuffer );
size_t stride = CVPixelBufferGetBytesPerRow( pixelBuffer );

vImage_Buffer _img = {
    .data = base,
    .height = height,
    .width = width,
    .rowBytes = stride
};

vImage_Error err;
vImage_Buffer _dstA, _dstR, _dstG, _dstB;

err = vImageBuffer_Init( &_dstA, height, width, 8 * sizeof( uint8_t ), kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageBuffer_Init (alpha) error: %ld", err);

err = vImageBuffer_Init( &_dstR, height, width, 8 * sizeof( uint8_t ), kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageBuffer_Init (red) error: %ld", err);

err = vImageBuffer_Init( &_dstG, height, width, 8 * sizeof( uint8_t ), kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageBuffer_Init (green) error: %ld", err);

err = vImageBuffer_Init( &_dstB, height, width, 8 * sizeof( uint8_t ), kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageBuffer_Init (blue) error: %ld", err);

err = vImageConvert_ARGB8888toPlanar8(&_img, &_dstA, &_dstR, &_dstG, &_dstB, kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageConvert_ARGB8888toPlanar8 error: %ld", err);

err = vImageEqualization_Planar8(&_dstR, &_dstR, kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageEqualization_Planar8 (red) error: %ld", err);

err = vImageEqualization_Planar8(&_dstG, &_dstG, kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageEqualization_Planar8 (green) error: %ld", err);

err = vImageEqualization_Planar8(&_dstB, &_dstB, kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageEqualization_Planar8 (blue) error: %ld", err);

err = vImageConvert_Planar8toARGB8888(&_dstA, &_dstR, &_dstG, &_dstB, &_img, kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageConvert_Planar8toARGB8888 error: %ld", err);

err = vImageContrastStretch_ARGB8888( &_img, &_img, kvImageNoError );
if (err != kvImageNoError)
    NSLog(@"vImageContrastStretch_ARGB8888 error: %ld", err);

free(_dstA.data);
free(_dstR.data);
free(_dstG.data);
free(_dstB.data);

CVPixelBufferUnlockBaseAddress( pixelBuffer, 0 );

return (CVPixelBufferRef)CFRetain( pixelBuffer );

}

Notice that I allocate the alpha channel, even though I perform nothing on it; that's simply because converting back and forth between ARGB8888 and Planar8 requires alpha-channel buffer allocation and reference. Same performance and quality enhancements, regardless.

Also note that I perform contrast stretching after converting the Planar8 buffers into a single ARGB8888 buffer; that's because it's faster than applying the function channel-by-channel, as I did with the histogram equalization function, and gets the same results as doing it individually (the contrast stretching function does not cause the same alpha-channel distortion as histogram equalization).

Oh, one other thing: if you do it this way (that is, omit the alpha channel from equalization and contrast stretching), the image will look a hundred times better. For some reason, applying these "enhancements" to the alpha channel heavily distorts an ARGB composite. — James Bush, May 25 '15 at 23:06
That is fascinating information. Didn't even consider that. Did you find this via experiment? — Cameron Lowell Palmer, May 26 '15 at 10:16
Experimentation is my forté; I always explore every avenue of possibility before putting a product in someone's hand. And, as you just said, the results can indeed be fascinating. — James Bush, Jul 21 '15 at 05:55
So you went from ~20 ms (50 fps) to ~33 ms (30 fps), or am I reading that wrong? Also, I'm surprised that converting to planar and back is faster than just doing it straight on the ARGB image, but Apple also seems to imply this is faster. Is it because you then only have to deal with 3 channels that are prepped for SIMD? It seems like a lot of copying to me, but somehow it's still faster... — SO_fix_the_vote_sorting_bug, Oct 20 '21 at 02:01

score 6 · Answer 2 · answered Feb 26 '15 at 12:01

6

Don't keep re-allocating vImage_Buffer if you can avoid it.

One thing that is critical to vImage accelerate performance is the reuse of vImage_Buffers. I can't say how many times I read in Apple's limited documentation hints to this effect, but I was definitely not listening.

In the aforementioned blur code example, I reworked the test app to setup the vImage_Buffer input and output buffers once per image rather than once for each call to boxBlur. I dropped <10ms per call which made a noticeable difference in response time.

This says that Accelerate needs time to warm-up before you start seeing performance improvements. The first call to this method took 34ms.

- (UIImage *)boxBlurWithSize:(int)boxSize
{
    vImage_Error error;
    error = vImageBoxConvolve_ARGB8888(&_inputImageBuffer,
                                       &_outputImageBuffer,
                                       NULL,
                                       0,
                                       0,
                                       boxSize,
                                       boxSize,
                                       NULL,
                                       kvImageEdgeExtend);
    if (error) {
        NSLog(@"vImage error %zd", error);
    }

    CGImageRef modifiedImageRef = vImageCreateCGImageFromBuffer(&_outputImageBuffer,
                                                                &_inputImageFormat,
                                                                NULL,
                                                                NULL,
                                                                kvImageNoFlags,
                                                                &error);

    UIImage *returnImage = [UIImage imageWithCGImage:modifiedImageRef];
    CGImageRelease(modifiedImageRef);

    return returnImage;
}

answered Feb 26 '15 at 12:01

Cameron Lowell Palmer

21,528
7
125
126

Accelerate runs at whatever speed it can. The issue here is that new memory beyond a certain size is allocated virtually only and then only mapped in later. Every time you touch a new page, the OS kernel takes a fault, zeros the entire thing, then swaps back in. This is what slows down Accelerate. Pre-Allocating and reusing memory allows the vector code to run uninterrupted, which means that it runs flat out. This is a problem for everything, not just Accelerate. However, when you are pushing the speed of light, then cosmic dust like this becomes a problem. – Ian Ollmann Feb 27 '15 at 23:46
1

@IanOllmann absolutely. My goal in documenting these items is to nail down these key concepts. Some of these topics are mentioned in passing in what little documentation exists, but I've seen lots of terrible examples on the net that assume it is fast because it is using Accelerate. Since the mechanics of Accelerate are hidden, by design, when experimenting you might time either side of the call and ignore the malloc/free time, but as we have established malloc and free isn't really the performance issue. – Cameron Lowell Palmer Feb 28 '15 at 08:28
@CameronLowellPalmer To get clarity between `reallocation` and `reuse` of vImage buffers. 1) I assume this example is good cases for good `reuse` of vImage buffer - https://github.com/Itseez/opencv_for_ios_book_samples/blob/0b38fb11b63b2c96723906309c644447ba4fa8cc/CvEffects/CvEffects/Processing_Accelerate.cpp#L18 - 2) And this example would be the case of incorrect / inefficient reallocation of vImage buffer - https://github.com/Duffycola/opencv-ios-demos/blob/41d575284675553b5671abfc0facd1778e319b5e/shared/CvFilterController.mm#L180 - Am I correct ? – kiranpradeep Mar 29 '15 at 16:04
1

@Kiran Generally speaking those are good examples of dos and donts. OpenCV reuses memory if it can, so you're relying on the behavior of OpenCV, whereas malloc in the code block is definitely a bad sign. – Cameron Lowell Palmer Mar 30 '15 at 05:51
1

There's no allocation of a vImage_Buffer when using OpenCV; no alloc, init, malloc, or any some such. You simply pass the OpenCV matrix as a reference (i.e., prepend an ampersand) to a method, and create a buffer pointer as provided in my example at the bottom of this post. – James Bush May 25 '15 at 23:27

score 5 · Answer 3 · answered May 25 '15 at 23:38

To use vImage with OpenCV, pass a reference to your OpenCV matrix to a method like this one:

long contrastStretch_Accelerate(const Mat& src, Mat& dst) {
    vImagePixelCount rows = static_cast<vImagePixelCount>(src.rows);
    vImagePixelCount cols = static_cast<vImagePixelCount>(src.cols);

    vImage_Buffer _src = { src.data, rows, cols, src.step };
    vImage_Buffer _dst = { dst.data, rows, cols, dst.step };

    vImage_Error err;

    err = vImageContrastStretch_ARGB8888( &_src, &_dst, 0 );
    return err;
}

The call to this method, from your OpenCV code block, looks like this:

- (void)processImage:(Mat&)image;
{
    contrastStretch_Accelerate(image, image);
}

It's that simple, and since these are all pointer references, there's no "deep copying" of any kind. It's as fast and efficient as it can possibly be, all questions of context and other related performance-considerations aside (I can help you with those, too).

SIDENOTE: Did you know that you have to change the channel permutation when mixing OpenCV with vImage? If not, prior to calling any vImage functions on an OpenCV matrix, call:

const uint8_t map[4] = { 3, 2, 1, 0 };
err = vImagePermuteChannels_ARGB8888(&_img, &_img, map, kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImagePermuteChannels_ARGB8888 error: %ld", err);

Perform the same call, map and all, to return the image to the channel order proper for an OpenCV matrix.

Yes, I'm familiar with using OpenCV as backing for the images, which is totally useful if you're using OpenCV in a project. +1 — Cameron Lowell Palmer, May 26 '15 at 10:15

iOS Accelerate Framework vImage - Performance improvement?

3 Answers3

Don't keep re-allocating vImage_Buffer if you can avoid it.

Linked