Slow performance of CGImage averaging

Question

I am trying to create an image from an average of multiple images. The way I do this is to loop through the pixel value of 2 photos, add them together and divide by two. Simple math. However, while this is working, it is extremely slow (about 23 seconds to average 2x 10MP photos on a maximum specced MacBook Pro 15" 2016, compared to very much less time using Apples CIFilter API for similar algorithms). The code I'm currently using is this, based on another StackOverflow question here:

static func averageImages(primary: CGImage, secondary: CGImage) -> CGImage? {
        guard (primary.width == secondary.width && primary.height == secondary.height) else {
            return nil
        }

        let colorSpace       = CGColorSpaceCreateDeviceRGB()
        let width            = primary.width
        let height           = primary.height
        let bytesPerPixel    = 4
        let bitsPerComponent = 8
        let bytesPerRow      = bytesPerPixel * width
        let bitmapInfo       = RGBA32.bitmapInfo

        guard let context = CGContext(data: nil, width: width, height: height, bitsPerComponent: bitsPerComponent, bytesPerRow: bytesPerRow, space: colorSpace, bitmapInfo: bitmapInfo) else {
            print("unable to create context")
            return nil
        }

        guard let context2 = CGContext(data: nil, width: width, height: height, bitsPerComponent: bitsPerComponent, bytesPerRow: bytesPerRow, space: colorSpace, bitmapInfo: bitmapInfo) else {
            print("unable to create context 2")
            return nil
        }

        context.draw(primary, in: CGRect(x: 0, y: 0, width: width, height: height))

        context2.draw(secondary, in: CGRect(x: 0, y: 0, width: width, height: height))


        guard let buffer = context.data else {
            print("Unable to get context data")
            return nil
        }

        guard let buffer2 = context2.data else {
            print("Unable to get context 2 data")
            return nil
        }

        let pixelBuffer = buffer.bindMemory(to: RGBA32.self, capacity: width * height)
        let pixelBuffer2 = buffer2.bindMemory(to: RGBA32.self, capacity: width * height)

        for row in 0 ..< Int(height) {
            if row % 10 == 0 {
                print("Row: \(row)")
            }

            for column in 0 ..< Int(width) {
                let offset = row * width + column

                let picture1 = pixelBuffer[offset]
                let picture2 = pixelBuffer2[offset]

                let minR = min(255,(UInt32(picture1.redComponent)+UInt32(picture2.redComponent))/2)
                let minG = min(255,(UInt32(picture1.greenComponent)+UInt32(picture2.greenComponent))/2)
                let minB = min(255,(UInt32(picture1.blueComponent)+UInt32(picture2.blueComponent))/2)
                let minA = min(255,(UInt32(picture1.alphaComponent)+UInt32(picture2.alphaComponent))/2)


                pixelBuffer[offset] = RGBA32(red: UInt8(minR), green: UInt8(minG), blue: UInt8(minB), alpha: UInt8(minA))
            }
        }

        let outputImage = context.makeImage()


        return outputImage
    }

    struct RGBA32: Equatable {
        //private var color: UInt32
        var color: UInt32

        var redComponent: UInt8 {
            return UInt8((color >> 24) & 255)
        }

        var greenComponent: UInt8 {
            return UInt8((color >> 16) & 255)
        }

        var blueComponent: UInt8 {
            return UInt8((color >> 8) & 255)
        }

        var alphaComponent: UInt8 {
            return UInt8((color >> 0) & 255)
        }

        init(red: UInt8, green: UInt8, blue: UInt8, alpha: UInt8) {
            let red   = UInt32(red)
            let green = UInt32(green)
            let blue  = UInt32(blue)
            let alpha = UInt32(alpha)
            color = (red << 24) | (green << 16) | (blue << 8) | (alpha << 0)
        }

        init(color: UInt32) {
            self.color = color
        }

        static let red     = RGBA32(red: 255, green: 0,   blue: 0,   alpha: 255)
        static let green   = RGBA32(red: 0,   green: 255, blue: 0,   alpha: 255)
        static let blue    = RGBA32(red: 0,   green: 0,   blue: 255, alpha: 255)
        static let white   = RGBA32(red: 255, green: 255, blue: 255, alpha: 255)
        static let black   = RGBA32(red: 0,   green: 0,   blue: 0,   alpha: 255)
        static let magenta = RGBA32(red: 255, green: 0,   blue: 255, alpha: 255)
        static let yellow  = RGBA32(red: 255, green: 255, blue: 0,   alpha: 255)
        static let cyan    = RGBA32(red: 0,   green: 255, blue: 255, alpha: 255)

        static let bitmapInfo = CGImageAlphaInfo.premultipliedLast.rawValue | CGBitmapInfo.byteOrder32Little.rawValue

        static func ==(lhs: RGBA32, rhs: RGBA32) -> Bool {
            return lhs.color == rhs.color
        }
    }

I'm not very experienced when it comes to working with RAW pixel values and there is probably room for much optimisation. The declaration of RGBA32 may not be required, but again I'm not sure how I'd go about simplifying the code. I've tried simply replacing that struct with a UInt32, however, as I divide by 2 the separation between the four channels gets messed up and I end up with the wrong result (on a positive note this brings the computing time down to about 6 seconds).

I've tried dropping the alpha channel (just hardcoding it to 255) and also dropping the safety checks that no values exceed 255. This has reduced the computing time to 19 seconds. However, it is far from the 6 seconds I was hoping to get close to and it would also be nice to average the alpha channel too.

Note: I am aware of CIFilters; however, darkening an image first, then using CIAdditionCompositing filter does not work as the API provided by Apple is actually using a more complex algorithm than straight forward addition. For more details on this, see here for my previous code on the subject and a similar question here with testing proving that Apple's API is not a straight forward addition of pixel values.

**Edit: ** Thanks to all the feedback I have now been able to make vast improvements. The by far biggest difference was to change from debug to release, that dropped the time by a lot. Then, I was able to write faster code for the modification of the RGBA values, eliminating the need for a separate struct for this. That changed the time from 23 seconds to about 10 (plus the debug to release improvements). The code now looks like this, also being rewritten a bit to look more readable:

static func averageImages(primary: CGImage, secondary: CGImage) -> CGImage? {
    guard (primary.width == secondary.width && primary.height == secondary.height) else {
        return nil
    }

    let colorSpace       = CGColorSpaceCreateDeviceRGB()
    let width            = primary.width
    let height           = primary.height
    let bytesPerPixel    = 4
    let bitsPerComponent = 8
    let bytesPerRow      = bytesPerPixel * width
    let bitmapInfo       = CGImageAlphaInfo.premultipliedLast.rawValue | CGBitmapInfo.byteOrder32Little.rawValue

    guard let primaryContext = CGContext(data: nil, width: width, height: height, bitsPerComponent: bitsPerComponent, bytesPerRow: bytesPerRow, space: colorSpace, bitmapInfo: bitmapInfo),
        let secondaryContext = CGContext(data: nil, width: width, height: height, bitsPerComponent: bitsPerComponent, bytesPerRow: bytesPerRow, space: colorSpace, bitmapInfo: bitmapInfo) else {
            print("unable to create context")
            return nil
    }

    primaryContext.draw(primary, in: CGRect(x: 0, y: 0, width: width, height: height))
    secondaryContext.draw(secondary, in: CGRect(x: 0, y: 0, width: width, height: height))

    guard let primaryBuffer = primaryContext.data, let secondaryBuffer = secondaryContext.data else {
        print("Unable to get context data")
        return nil
    }

    let primaryPixelBuffer = primaryBuffer.bindMemory(to: UInt32.self, capacity: width * height)
    let secondaryPixelBuffer = secondaryBuffer.bindMemory(to: UInt32.self, capacity: width * height)

    for row in 0 ..< Int(height) {
        if row % 10 == 0 {
            print("Row: \(row)")
        }

        for column in 0 ..< Int(width) {
            let offset = row * width + column

            let primaryPixel = primaryPixelBuffer[offset]
            let secondaryPixel = secondaryPixelBuffer[offset]

            let red = (((primaryPixel >> 24) & 255)/2 + ((secondaryPixel >> 24) & 255)/2) << 24
            let green = (((primaryPixel >> 16) & 255)/2 + ((secondaryPixel >> 16) & 255)/2) << 16
            let blue = (((primaryPixel >> 8) & 255)/2 + ((secondaryPixel >> 8) & 255)/2) << 8
            let alpha = ((primaryPixel & 255)/2 + (secondaryPixel & 255)/2)

            primaryPixelBuffer[offset] = red | green | blue | alpha
        }
    }

    print("Done looping")
    let outputImage = primaryContext.makeImage()

    return outputImage
}

As for multithreading, I am going to run this function several times, and will therefore implement the multithreading over the iterations of the function rather than within the function itself. I do expect to get an even greater performance boost from this, but it also has to be balanced with the increased memory allocation of having more images in memory at the same time.

Thanks to everyone who contributed to this. Since all feedback has been through comments I can't mark any of them as the right answer. I also don't want to post my updated code as an answer as I wasn't the one who really made the answer. Any suggestions on how to proceed?

Probably not the answer you want hear but you need to build your own core image kernel to shift this onto the GPU. This book will help https://gumroad.com/l/CoreImageForSwift — Warren Burton, Mar 13 '20 at 22:16
You will get considerable improvement in this `for` loop by parallelizing with `concurrentPerform` (and striding) so you get all the cores into the process. Possibly even better is [vImage](https://developer.apple.com/documentation/accelerate/vimage), such as [Alpha Compositing](https://developer.apple.com/documentation/accelerate/vimage/vimage_operations/alpha_compositing) or write your own vector code with [vDSP](https://developer.apple.com/documentation/accelerate/vdsp). — Rob, Mar 14 '20 at 01:08
All of this having been said, when I benchmarked this on a 5,000 × 5,000 pixel image, my similarly equipped machine is solving this in a fraction of a second. Are you sure you’re using a optimized/release build? Parallelizing it improved it even more, but it’s not material when it’s already this fast. — Rob, Mar 14 '20 at 02:23
CIImage filter uses Accelerate framework and the GPU, and you don’t. Also: if you want to know why something is slow, use Instruments (time profiler). — matt, Mar 14 '20 at 04:35
Thanks for all the feedback, see my updated post for improvements. Since all feedback has been through comments I can't mark any of them as the right answer. I also don't want to post my updated code as an answer as I wasn't the one who really made the answer. Any suggestions on how to proceed to give credit where credit is due? — Jorn, Mar 14 '20 at 10:09
Answering your own question is legal and encouraged. Do _not_ edit the original question to include the answer. — matt, Mar 14 '20 at 13:54
Thanks @Jorn ! The release mode made a big improvement in my code! Another (mainly readability) improvement is to replace the nested row/column-loop and the calculation of offset by a simple loop, i.e: for offset in 0 ..< height*width {... } — ragnarius, Aug 06 '20 at 18:06

score 1 · Accepted Answer · answered Mar 14 '20 at 23:07

There are a few options:

Parallelize the routine:

You can improve performance with concurrentPerform, to move the processing to multiple cores. It it’s simplest form, you can just replace your outer for loop with concurrentPerform:

extension CGImage {
    func average(with secondImage: CGImage) -> CGImage? {
        guard
            width == secondImage.width,
            height == secondImage.height
        else {
            return nil
        }

        let colorSpace       = CGColorSpaceCreateDeviceRGB()
        let bytesPerPixel    = 4
        let bitsPerComponent = 8
        let bytesPerRow      = bytesPerPixel * width
        let bitmapInfo       = RGBA32.bitmapInfo

        guard
            let context1 = CGContext(data: nil, width: width, height: height, bitsPerComponent: bitsPerComponent, bytesPerRow: bytesPerRow, space: colorSpace, bitmapInfo: bitmapInfo),
            let context2 = CGContext(data: nil, width: width, height: height, bitsPerComponent: bitsPerComponent, bytesPerRow: bytesPerRow, space: colorSpace, bitmapInfo: bitmapInfo),
            let buffer1 = context1.data,
            let buffer2 = context2.data
        else {
            return nil
        }

        context1.draw(self,        in: CGRect(x: 0, y: 0, width: width, height: height))
        context2.draw(secondImage, in: CGRect(x: 0, y: 0, width: width, height: height))

        let imageBuffer1 = buffer1.bindMemory(to: UInt8.self, capacity: width * height * 4)
        let imageBuffer2 = buffer2.bindMemory(to: UInt8.self, capacity: width * height * 4)

        DispatchQueue.concurrentPerform(iterations: height) { row in   // i.e. a parallelized version of `for row in 0 ..< height {`
            var offset = row * bytesPerRow
            for _ in 0 ..< bytesPerRow {
                offset += 1

                let byte1 = imageBuffer1[offset]
                let byte2 = imageBuffer2[offset]

                imageBuffer1[offset] = byte1 / 2 + byte2 / 2
            }
        }

        return context1.makeImage()
    }
}

Note, a few other observations:

Because you're doing the same calculation on every byte, you might simplify this further, getting rid of casts, shifts, masks, etc. I also moved repetitive calculations out of the inner loop.
As a result, I’m using UInt8 type and iterating through bytesPerRow.
FWIW, I’ve defined this as a CGImage extension, which is invoked as:
```
let combinedImage = image1.average(with: image2)
```
Right now, we’re striding through the pixels by row in the pixel array. You can play around with actually changing this to process multiple pixels per iteration of concurrentPerform, though I didn’t see a material change when I did that.

I found that concurrentPerform was many times faster than the non-parallelized for loop. Unfortunately, the nested for loop are only a small part of the overall processing time of the entire function (e.g. once you include overhead of building these two pixel buffers, the overall performance is only 40% faster than the non-optimized rendition). On well-spec’ed MBP 2018, it processes 10,000 × 10,000 px images in under half a second.

The other alternative is the Accelerate vImage library.

This library offers wide variety of image processing routines and is a good library to familiarize yourself with if you’re going to be processing large images. I don’t know if its alpha compositing algorithm is mathematically identical to an “average the byte values” algorithm, but might be sufficient for your purposes. It has the virtue that it reduces your nested for loops with a single API call. It also opens the door for a far wider variety of types of image compositing and manipulation routines:

extension CGImage {
    func averageVimage(with secondImage: CGImage) -> CGImage? {
        let bitmapInfo: CGBitmapInfo = [.byteOrder32Little, CGBitmapInfo(rawValue: CGImageAlphaInfo.premultipliedLast.rawValue)]
        let colorSpace = CGColorSpaceCreateDeviceRGB()

        guard
            width == secondImage.width,
            height == secondImage.height,
            let format = vImage_CGImageFormat(bitsPerComponent: 8, bitsPerPixel: 32, colorSpace: colorSpace, bitmapInfo: bitmapInfo)
        else {
            return nil
        }

        guard var sourceBuffer = try? vImage_Buffer(cgImage: self, format: format) else { return nil }
        defer { sourceBuffer.free() }

        guard var sourceBuffer2 = try? vImage_Buffer(cgImage: secondImage, format: format) else { return nil }
        defer { sourceBuffer2.free() }

        guard var destinationBuffer = try? vImage_Buffer(width: width, height: height, bitsPerPixel: 32) else { return nil }
        defer { destinationBuffer.free() }

        guard vImagePremultipliedConstAlphaBlend_ARGB8888(&sourceBuffer, Pixel_8(127), &sourceBuffer2, &destinationBuffer, vImage_Flags(kvImageNoFlags)) == kvImageNoError else {
            return nil
        }

        return try? destinationBuffer.createCGImage(format: format)
    }
}

Anyway, I found the performance here to be similar to the concurrentPerform algorithm.

For giggles and grins, I also tried rendering the images with CGBitmapInfo.floatComponents and used BLAS catlas_saxpby for one-line call to average the two vectors. It worked well, but, unsurprisingly, was slower than the above integer-based routines.

The vImage alpha blending functions will look to the alpha channel to decide how much of each pixel from the top layer and how much from the bottom layer to use to make the final pixel. You are looking for more of a constant blending operation with some sort of external blending factor (0.5) to combine the layers. — Ian Ollmann, May 29 '20 at 00:42

score 0 · Answer 2 · answered May 29 '20 at 00:46

This is a little bit hacky but will work and is the algorithm you are looking for. Use vImageMatrixMultiply_Planar< channel fmt >() to scale each layer and add them together. The matrix coefficient for the layer is the weight for that layer, presumably 1/N for N layers, if you want them equally weighted.

Since we are using a planar function on possibly interleaved data, you'll need to multiply the width of the src and dest buffers by the number of channels in the image.

Slow performance of CGImage averaging

2 Answers2