Optimizing nested for loops in Swift

Question

I got this method that counts white pixels in a UIImage, I need to go through all the pixels to increase the counter with each white pixel that I find. I´m trying to improve the performance of it but I don´t find a better approach. Any ideas?

func whitePixelCount() -> Int {
    let width = Int(image.size.width)
    let height = Int(image.size.height)
    var counter = 0
    for x in 0..<(width*scale) {
        for y in 0..<(height*scale) {
            // We multiply per 4 because of the 4 channels, RGBA, but later we just use the Alpha
            let pixelIndex = (width * y + x) * 4

            if pointer[pixelIndex + Component.alpha.rawValue] == 255 {
                counter += 1
            }
        }
    }
    return counter
}

Component.alpha.rawValue is equal to 3
scale is Int(image.scale)

pointer comes from:

guard let cfdata = self.image.cgImage?.dataProvider?.data,
    let pointer = CFDataGetBytePtr(cfdata) else {
        return nil
}

Provide some more details on the performance. How did you benchmark this? What optimization flags are you bench-marking under? What images are in your test suite? — Alexander, Oct 22 '19 at 20:16
But in principle, this can only get so fast while it's running on the CPU. If you want this code to be performant, you could try to lift it to the GPU, but to do that, you would need to provide *a lot* more information on what exactly you're trying to do by counting white pixels. — Alexander, Oct 22 '19 at 20:16
See https://stackoverflow.com/a/31661519/1271826 for example of how to render image to pixel buffer. This is adapted Apple’s [Technical Q&A 1509](https://developer.apple.com/library/archive/qa/qa1509/_index.html), which is an old Objective-C example of dealing with pixel buffers. The code snippet is outdated, but the general principles still apply. — Rob, Oct 22 '19 at 23:45
Or, when you say “white pixels” did you mean “opaque pixels of any color”? — Rob, Oct 23 '19 at 00:06
What I want to do is to know the exact number of white pixels in a black&white image, for each image I receive, and I receive a lot of them continuosly. Right now with this code, the processor is working at 130-140%, and I want to reduce the use of the CPU. — diegomen, Oct 23 '19 at 19:42
Yeah, but it looks like you’re assuming four bytes per pixel (e.g. a RGBA style format). So looking at the alpha channel tells you whether the pixel has any alpha or not, not the color of the pixel. Hey, if it works for you, fine, do whatever you want. E.g. maybe you’re dealing with a mask, not an image. But it just looks strange to say “I’m looking for white pixels”, whereas your code appears to be looking at alpha channel. But it’s not relevant to the question at hand, so I’ll drop it. — Rob, Oct 23 '19 at 21:45
By question on your speed question is (a) are you measuring just the `for` loop; and (b) are you doing this in an optimized/release build? I ask that because, in my tests, my iPhone is processing that `for` loop for 12 megapixel image in 12-16 msec. Sure you can cut that down to ⅓ to ¼ to the time, but the time savings is about 0.01 seconds, generally not worth the effort. — Rob, Oct 23 '19 at 21:48
You're right Rob! sorry I forgot to add that yes, I'm dealing with a mask :) — diegomen, Oct 24 '19 at 09:23

Rob · Accepted Answer · 2019-10-23T21:39:47.093

A couple of observations:

Make sure you’re using an optimized/release build, not an unoptimized debug build. On my device, a debug build takes about 4 seconds to process a 12 megapixel image, whereas the release build takes 0.3 seconds.
When you have a for loop, you can parallelize it to leverage all of the cores on the CPU. By doing this with the striding algorithm, the for loop was almost 4 times faster.

That sounds great, but unfortunately, the problem is that of the 0.3 seconds to process the image, most of that was the preparation of the image buffer. (Now, in your example, you’re not rerendering it into a predefined pixel buffer, which is a little dangerous IMHO, so maybe you don’t have this overhead. But, regardless, the difference of 10+ msec is not generally observable unless you’re processing hundreds of images.) The actual for loop was only accounting for 16 msec of the elapsed time. So while cutting that down to 4 msec is almost 4 times faster, but from a user perspective, it is immaterial.

Anyway, feel free to see the parallel algorithm with striding below, in my original answer.

One very simple approach to improving for loop performance is to use concurrentPerform to parallelize the routine:

For example, here is a non-parallelized routine:

var total = 0

for x in 0..<maxX {
    for y in 0..<maxY {
        if ... {
            total += 1
        }
    }
}

print(total)

You can parallelize it by

Flipping the x and y loops, because we want the outer loop to be a row in the image. The idea is to ensure that not only should each thread be working with contiguous memory blocks, but we want to minimize the amount of overlapping to avoid “cache sloshing”. Thus consider:
```
for y in 0..<maxY {
    for x in 0..<maxX {
        if ... {
            total += 1
        }
    }
}
```
We’re not actually going to use the above, but we’ll use it as a model in the next step;

replacing the outer for loop (now the y coordinate) with concurrentPerform:

var total = 0

let syncQueue = DispatchQueue(label: "...")

DispatchQueue.concurrentPerform(iterations: maxY) { y in
    var subTotal = 0
    for x in 0..<maxX {
        if ... {
            subTotal += 1
        }
    }
    syncQueue.sync {
        total += subTotal
    }
}

print(total)

So, the idea is:

replace outer for loop with concurrentPerform;
rather than trying update total for every iteration of x, have a subTotal variable for each thread and only update total at the end (minimizing contention from multiple threads for this shared resource); and
use some synchronization mechanism (I’ve used a serial queue here, but any synchronization mechanism will work) to update total to ensure thread-safety.

I was trying to keep the example as simple as possible, but there are even other optimizations one can do:

Different synchronization techniques offer different performance. E.g. you can use NSLock (which conventional wisdom says is slower, but my recent benchmarks suggest that the performance can be better than GCD in many scenarios) by defining a sync method in a protocol extension (to give a nice, safe way of using locks) like so:
```
// Adapted from Apple’s `withCriticalSection` code sample

extension NSLocking {
    func sync<T>(_ closure: () throws -> T) rethrows -> T {
        lock()
        defer { unlock() }
        return try closure()
    }
}
```
Then you can do something like:
```
let lock = NSLock()

DispatchQueue.concurrentPerform(iterations: maxY) { y in
    var subTotal = 0
    for x in 0..<maxX {
        if ... {
            subTotal += 1
        }
    }
    lock.sync {
        total += subTotal
    }
}

print(total)
```
Feel free to try whatever synchronization mechanisms you want. But the idea is that if you’re going to access total from multiple threads, make sure to do so in a thread-safe manner. Temporarily turn on the “Thread Sanitizer” if you want to check your thread-safety.

If there’s not enough work on each thread (e.g. maxX isn’t very large or, as in this case, the algorithm is so fast), the overhead of the parallelized routine can start to offset the benefits of getting multiple cores involved in the calculation. So you can “stride” through multiple rows of y in each iteration. For example:

let lock = NSLock()

let stride = maxY / 20
let iterations = Int((Double(height) / Double(stride)).rounded(.up))

DispatchQueue.concurrentPerform(iterations: iterations) { i in
    var subTotal = 0
    let range = i * stride ..< min(maxY, (i + 1) * stride)
    for y in range {
        for x in 0 ..< maxX {
            if ... {
                subTotal += 1
            }
        }
    }

    lock.sync { count += subTotal }
}

You can see my code here: https://github.com/robertmryan/CountWhitePixels — Rob, Oct 23 '19 at 23:07
Wow looks super good! thx you very much, I will try it this evening :) and yes, I'm analyzing 60 images per second, so, any ms I can save, will be good for the performance. Thanks again! — diegomen, Oct 24 '19 at 09:27

score -1 · Answer 2 · answered Oct 22 '19 at 20:40

In general big(o) performance can be increased by replaceing your for loops with a while loop thats says while x < array.count && y < array2.count { insert code to do here }

another approach is spliting your image into seperate components and sending them to different threads and recombining the resulting arrays. Youll want to use gcd * workitems async to do this.

Optimizing nested for loops in Swift

2 Answers2

Linked