OS X GCD multi-thread concurrency uses more CPU but executes slower than single thread

Question

I have a method which does a series of calculations which take quite a bit of time to complete. The objects that this method does computations on are generated at runtime and can range from a few to a few thousand. Obviously it would be better if I could run these computations across several threads concurrently, but when I try that, my program uses more CPU yet takes longer than running them one-by-one. Any ideas why?

let itemsPerThread = (dataArray.count / 4) + 1

for var i = 0; i < dataArray.count; i += itemsPerThread
{

    let name = "ComputationQueue\(i)".bridgeToObjectiveC().cString()
    let compQueue = dispatch_queue_create(name, DISPATCH_QUEUE_CONCURRENT)
    dispatch_async(compQueue,
    {
        let itemCount = i + itemsPerThread < dataArray.count ? itemsPerThread : dataArray.count - i - 1

        let subArray = dataArray.bridgeToObjectiveC().subarrayWithRange(NSMakeRange(i, dataCount)) as MyItem[]
        self.reallyLongComputation(subArray, increment: increment, outputIndex: self.runningThreads-1)
    })

    NSThread.sleepForTimeInterval(1)
}

Alternatively: If I run this same thing, but a single dispatch_async call and on the whole dataArray rather than the subarrays, it completes much faster while using less CPU.

Why would running the computations across more threads use less CPU?? — Hot Licks, Jul 11 '14 at 21:23
@HotLicks Because for some reason it is significantly slower than the single-threaded and less CPU intensive alternative... — Garrett, Jul 11 '14 at 21:24
Multithreading is not a panacea -- having a program slow down after adding multithreading is pretty common, and it may take some additional reorganization to actually see a speedup. Some reasons why are listed here: http://stackoverflow.com/questions/612860/what-can-make-a-program-run-slower-when-using-more-threads … also http://unix.stackexchange.com/questions/80424/why-using-more-threads-makes-it-slower-than-using-less-threads — Jeremy Friesner, Jul 11 '14 at 21:35
@HotLicks I would expect that both multi-threading and higher CPU usage would mean that it was going faster — Garrett, Jul 11 '14 at 22:35
It takes more CPU just to manage multiple threads. And with multiple threads you lose a lot of CPU to maintaining cache coherency. The total CPU used will *always* be higher. *If* you can make use of multiple processors (not always possible, even in a MP box) then *sometimes* you can get an overall speed improvement by splitting the work among threads. But generally this is only worthwhile if there is I/O or network activity in the individual pieces of work, so that can be interleaved. — Hot Licks, Jul 11 '14 at 22:39
@Garret Your expectation is fine, not your code design. i made a simple example (see my answer), which reflect your idea (distribute the job in chunks) — user3441734, Nov 12 '15 at 12:26
FYI, `dispatch_apply` is an GCD mechanism for performing a series of concurrent tasks. See [Performing Loop Iterations Concurrently](https://developer.apple.com/library/ios/documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html#//apple_ref/doc/uid/TP40008091-CH102-SW23). And if you want to do a lot on each thread, consider striding, discussed in [Improving On Loop Code](https://developer.apple.com/library/ios/documentation/General/Conceptual/ConcurrencyProgrammingGuide/ThreadMigration/ThreadMigration.html#//apple_ref/doc/uid/TP40008091-CH105-SW2). — Rob, Dec 10 '15 at 07:34

user3441734 · Answer 1 · 2015-11-16T18:13:34.307

what you (it is my guess) want to do should looks like

//
//  main.swift
//  test
//
//  Created by user3441734 on 12/11/15.
//  Copyright © 2015 user3441734. All rights reserved.
//

import Foundation

let computationGroup = dispatch_group_create()

var arr: Array<Int> = []
for i in 0..<48 {
    arr.append(i)
}
print("arr \(arr)")

func job(inout arr: Array<Int>, workers: Int) {
    let count = arr.count
    let chunk = count / workers
    guard chunk * workers == count else {
        print("array.cout divided by workers must by integer !!!")
        return
    }
    let compQueue = dispatch_queue_create("test", DISPATCH_QUEUE_CONCURRENT)
    let syncQueue = dispatch_queue_create("aupdate", DISPATCH_QUEUE_SERIAL)

    for var i = 0; i < count; i += chunk
    {
        let j = i
        var tarr = arr[j..<j+chunk]
        dispatch_group_enter(computationGroup)
        dispatch_async(compQueue) {  () -> Void in
            for k in j..<j+chunk {
                // long time computation
                var z = 100000000
                repeat {
                    z--
                } while z > 0
                // update with chunk
                tarr[k] = j
            }
            dispatch_async(syncQueue, { () -> Void in
                for k in j..<j+chunk {
                    arr[k] = tarr[k]
                }
                dispatch_group_leave(computationGroup)
            })

        }
    }
    dispatch_group_wait(computationGroup, DISPATCH_TIME_FOREVER)
}
var stamp: Double {
    return NSDate.timeIntervalSinceReferenceDate()
}

print("running on dual core ...\n")
var start = stamp
job(&arr, workers: 1)
print("job done by 1 worker in \(stamp-start) seconds")
print("arr \(arr)\n")

start = stamp
job(&arr, workers: 2)
print("job done by 2 workers in \(stamp-start) seconds")
print("arr \(arr)\n")

start = stamp
job(&arr, workers: 4)
print("job done by 4 workers in \(stamp-start) seconds")
print("arr \(arr)\n")

start = stamp
job(&arr, workers: 6)
print("job done by 6 workers in \(stamp-start) seconds")
print("arr \(arr)\n")

with results

arr [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]
running on dual core ...

job done by 1 worker in 5.16312199831009 seconds
arr [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

job done by 2 workers in 2.49235796928406 seconds
arr [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24]

job done by 4 workers in 3.18479603528976 seconds
arr [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36]

job done by 6 workers in 2.51704299449921 seconds
arr [0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 16, 16, 16, 16, 16, 16, 16, 16, 24, 24, 24, 24, 24, 24, 24, 24, 32, 32, 32, 32, 32, 32, 32, 32, 40, 40, 40, 40, 40, 40, 40, 40]

Program ended with exit code: 0

... you can use next pattern for distributing job between any number of workers (the number of workers which give you the best performance depends on worker definition and sources which are available in your environment). generally for any kind of long time calculation ( transformation ) you can expect some performance gain. in two core environment up to 50%. if your worker use highly optimized functions using more cores 'by default', the performance gain can be close to nothing :-)

// generic implementation
// 1) job distribute data between workers as fair, as possible
// 2) workers do their task in parallel
// 3) the order in resulting array reflect the input array
// 4) there is no requiremets of worker block, to return
//    the same type as result of yor 'calculation'

func job<T,U>(arr: [T], workers: Int, worker: T->U)->[U] {

    guard workers > 0 else { return [U]() }

    var res: Dictionary<Int,[U]> = [:]

    let workersQueue = dispatch_queue_create("workers", DISPATCH_QUEUE_CONCURRENT)
    let syncQueue = dispatch_queue_create("sync", DISPATCH_QUEUE_SERIAL)
    let group = dispatch_group_create()

    var j = min(workers, arr.count)
    var i = (0, 0, arr.count)
    var chunk: ArraySlice<T> = []

    repeat {
        let a = (i.1, i.1 + i.2 / j, i.2 - i.2 / j)
        i = a
        chunk = arr[i.0..<i.1]
        dispatch_group_async(group, workersQueue) { [i, chunk] in
            let arrs = chunk.map{ worker($0) }
            dispatch_sync(syncQueue) {[i,arrs] in
                res[i.0] = arrs
            }
        }
        j--
    } while j != 0

    dispatch_group_wait(group, DISPATCH_TIME_FOREVER)
    let idx = res.keys.sort()
    var results = [U]()
    idx.forEach { (idx) -> () in
        results.appendContentsOf(res[idx]!)
    }
    return results
}

Airsource Ltd · Answer 2 · 2014-07-31T10:09:55.033

0

You need to

Get rid of the 1 second sleep. This is artifically reducing the degree to which you get parallel execution, because you're waiting before starting the next thread. You are starting 4 threads - and you are therefore artifically delaying the start (and potentially the finish) of the final thread by 3 seconds.
Use a single concurrent queue, not one per dispatch block. A concurrent queue will start blocks in the order in which they are dispatched, but does not wait for one block to finish before starting the next one - i.e. it will run blocks in parallel.
NSArray is a thread-safe class. I presume that it uses a multiple-reader/single-writer lock internally, which means there is probably no advantage to be obtained from creating a set of subarrays. You are, however, incurring the overhead of creating the subArray
Multiple threads running on different cores cannot talk to the same cache line at the same time.Typical cache line size is 64 bytes, which seems unlikely to cause a problem here.

edited Jul 31 '14 at 10:09

answered Jul 15 '14 at 16:23

Airsource Ltd

32,379
13
71
75

reallyLongComputation generally takes 4-9 seconds per item in the subArray, and the subArray can contain anywhere from 1 to 1000 items. The waiting for 1 second was to fix another bug (which I forget right now). As for the multiple queues, it was my understanding that concurrent queues run in parallel with *each other*, not with computations in a single queue. – Garrett Jul 17 '14 at 13:32
Nope - a dispatch queue will start blocks in sequence, but only a serial queue will wait for one block to finish before starting the next one. A concurrent queue is at liberty to use as many threads as it wants, and there is no guarantee that blocks finish in the order in which they were started. All queues can run in parallel with each other, including serial queues. – Airsource Ltd Jul 17 '14 at 19:35
@aAirsource Ltd, the design of Garret's code should reflect sources which are available. all your answers above are right (i don't know to much about NSArray). the last sentence has no effect on Garret's design. Garret's design is wrong at all ... not the base idea behind. he can reduce the computation time by proper design, if he has at least two cores available – user3441734 Nov 12 '15 at 12:13

OS X GCD multi-thread concurrency uses more CPU but executes slower than single thread

2 Answers2