A couple of observations:
Make sure you’re using an optimized/release build, not an unoptimized debug build. On my device, a debug build takes about 4 seconds to process a 12 megapixel image, whereas the release build takes 0.3 seconds.
When you have a for
loop, you can parallelize it to leverage all of the cores on the CPU. By doing this with the striding algorithm, the for
loop was almost 4 times faster.
That sounds great, but unfortunately, the problem is that of the 0.3 seconds to process the image, most of that was the preparation of the image buffer. (Now, in your example, you’re not rerendering it into a predefined pixel buffer, which is a little dangerous IMHO, so maybe you don’t have this overhead. But, regardless, the difference of 10+ msec is not generally observable unless you’re processing hundreds of images.) The actual for
loop was only accounting for 16 msec of the elapsed time. So while cutting that down to 4 msec is almost 4 times faster, but from a user perspective, it is immaterial.
Anyway, feel free to see the parallel algorithm with striding below, in my original answer.
One very simple approach to improving for
loop performance is to use concurrentPerform
to parallelize the routine:
For example, here is a non-parallelized routine:
var total = 0
for x in 0..<maxX {
for y in 0..<maxY {
if ... {
total += 1
}
}
}
print(total)
You can parallelize it by
Flipping the x
and y
loops, because we want the outer loop to be a row in the image. The idea is to ensure that not only should each thread be working with contiguous memory blocks, but we want to minimize the amount of overlapping to avoid “cache sloshing”. Thus consider:
for y in 0..<maxY {
for x in 0..<maxX {
if ... {
total += 1
}
}
}
We’re not actually going to use the above, but we’ll use it as a model in the next step;
replacing the outer for
loop (now the y
coordinate) with concurrentPerform
:
var total = 0
let syncQueue = DispatchQueue(label: "...")
DispatchQueue.concurrentPerform(iterations: maxY) { y in
var subTotal = 0
for x in 0..<maxX {
if ... {
subTotal += 1
}
}
syncQueue.sync {
total += subTotal
}
}
print(total)
So, the idea is:
- replace outer
for
loop with concurrentPerform
;
- rather than trying update
total
for every iteration of x
, have a subTotal
variable for each thread and only update total
at the end (minimizing contention from multiple threads for this shared resource); and
- use some synchronization mechanism (I’ve used a serial queue here, but any synchronization mechanism will work) to update
total
to ensure thread-safety.
I was trying to keep the example as simple as possible, but there are even other optimizations one can do:
Different synchronization techniques offer different performance. E.g. you can use NSLock
(which conventional wisdom says is slower, but my recent benchmarks suggest that the performance can be better than GCD in many scenarios) by defining a sync
method in a protocol extension (to give a nice, safe way of using locks) like so:
// Adapted from Apple’s `withCriticalSection` code sample
extension NSLocking {
func sync<T>(_ closure: () throws -> T) rethrows -> T {
lock()
defer { unlock() }
return try closure()
}
}
Then you can do something like:
let lock = NSLock()
DispatchQueue.concurrentPerform(iterations: maxY) { y in
var subTotal = 0
for x in 0..<maxX {
if ... {
subTotal += 1
}
}
lock.sync {
total += subTotal
}
}
print(total)
Feel free to try whatever synchronization mechanisms you want. But the idea is that if you’re going to access total
from multiple threads, make sure to do so in a thread-safe manner. Temporarily turn on the “Thread Sanitizer” if you want to check your thread-safety.
If there’s not enough work on each thread (e.g. maxX
isn’t very large or, as in this case, the algorithm is so fast), the overhead of the parallelized routine can start to offset the benefits of getting multiple cores involved in the calculation. So you can “stride” through multiple rows of y
in each iteration. For example:
let lock = NSLock()
let stride = maxY / 20
let iterations = Int((Double(height) / Double(stride)).rounded(.up))
DispatchQueue.concurrentPerform(iterations: iterations) { i in
var subTotal = 0
let range = i * stride ..< min(maxY, (i + 1) * stride)
for y in range {
for x in 0 ..< maxX {
if ... {
subTotal += 1
}
}
}
lock.sync { count += subTotal }
}