Most efficient number of goroutines on this machine

Question

So I do have a concurrent quicksort implementation written by me. It looks like this:

func Partition(A []int, p int, r int) int {
    index := MedianOf3(A, p, r)
    swapArray(A, index, r)
    x := A[r]
    j := p - 1
    i := p
    for i < r {
        if A[i] <= x {
            j++
            tmp := A[j]
            A[j] = A[i]
            A[i] = tmp
        }
        i++
    }
    swapArray(A, j+1, r)
    return j + 1
}


func ConcurrentQuicksort(A []int, p int, r int) {
    wg := sync.WaitGroup{}
    if p < r {
        q := Partition(A, p, r)
        select {
        case sem <- true:
            wg.Add(1)
            go func() {
                ConcurrentQuicksort(A, p, q-1)
                <-sem
                wg.Done()
            }()
        default:
            Quicksort(A, p, q-1)
        }
        select {
        case sem <- true:
            wg.Add(1)
            go func() {
                ConcurrentQuicksort(A, q+1, r)
                <-sem
                wg.Done()
            }()
        default:
            Quicksort(A, q+1, r)
        }
    }
    wg.Wait()
}

func Quicksort(A []int, p int, r int) {
    if p < r {
        q := Partition(A, p, r)
        Quicksort(A, p, q-1)
        Quicksort(A, q+1, r)
    }
}

I have a sem buffered channel, which I use to limit the number of goroutines running (if its reaches that number, I dont set up another goroutine, I just do the normal quicksort on the subarray). First I started with 100, then I've changed to 50, 20. The benchmarks would get slightly better. But after switching to 10, it started to go back, times started to get bigger. So there is some arbitrary number, at least for my hardware, that makes the algorithm run most efficient.

When I was implementing this, I actually saw some SO question about the number of goroutines that would be the best and now I cannot find it (stupid Chrome history actually saves not all visited sites). Do you know how to calculate such a things? And it would be the best if I didn't have to hardcode it, just let the program do it itself.

P.S I have nonconcurrent Quicksort, which runs about 1.7x slower than this. As you can see in my code, I do Quicksort, when the number of running goroutines exceeds the number I've set up earlier. I thought what about using a ConcurrentQuicksort, but not calling it with go keyword, just simply calling it, and maybe if other goroutines finish their job, the ConcurrentQuicksort which I called would start to launch up goroutines, speeding up the process (cuz as you can see Quicksort would only launch recursive quicksorts, without goroutines). I did that, and actually the time was like 10% slower than the regular Quicksort. Do you know why would that happen?

I've played around with concurrent sorting (https://github.com/twotwotwo/sorts). I don't think there's a clean answer from first principles here; you just observe how much the switching cost is and size your work units so you don't spend too large a % of resources witching tasks. — twotwotwo, Jun 27 '17 at 01:39
Okay, thank you. Do you have any idea about the thing I put in PS section? — minecraftplayer1234, Jun 27 '17 at 01:42
This might be a possibile duplicate of https://stackoverflow.com/q/8509152/4639336 — reticentroot, Jun 27 '17 at 01:53
@reticentroot since the question is about what's influencing efficiency of a parallel quicksort rather than limits of the Go runtime, arguably distinct — twotwotwo, Jun 27 '17 at 01:57
Hence might rather than an outright flag from me, though it does show some interesting code to benchmark goroutines... Also how are you bench marking? Hopefully with a number of functions being run with go test =bench, if not see this article https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go in doing you can you quickly test and find the "sweet" spot — reticentroot, Jun 27 '17 at 02:02

twotwotwo · Accepted Answer · 2019-09-22T19:57:42.660

You have to experiment a bit with this stuff, but I don't think the main concern is goroutines running at once. As the answer @reticentroot linked to says, it's not necessarily a problem to run a lot of simultaneous goroutines.

I think your main concern should be total number of goroutine launches. The current implementation could theoretically start a goroutine to sort just a few items, and that goroutine would spend a lot more time on startup/coordination than actual sorting.

The ideal is you only start as many goroutines as you need to get good utilization of all your CPUs. If your work items are ~equal size and your cores are ~equally busy, then starting one task per core is perfect.

Here, tasks aren't evenly sized, so you might split the sort into somewhat more tasks than you have CPUs and distribute them. (In production you would typically use a worker pool to distribute work without starting a new goroutine for every task, but I think we can get away with skipping that here.)

To get a workable number of tasks--enough to keep all cores busy, but not so many that you create lots of overhead--you can set a minimum size (initial array size/100 or whatever), and only split off sorts of arrays larger than that.

In slightly more detail, there is a bit of cost every time you send a task off to the background. For starters:

Each goroutine launch spends a little time setting up the stack and doing scheduler bookkeeping
Each task switch spends some time in the scheduler and may incur cache misses when the two goroutines are looking at different code or data
Your own coordination code (channel sends and sync ops) takes time

Other things can prevent ideal speedups from happening: you could hit a systemwide limit on e.g. memory bandwidth as Volker pointed out, some sync costs can increase as you add cores, and you can run into various trickier issues sometimes. But the setup, switching, and coordination costs are a good place to start.

The benefit that can outweigh the coordination costs is, of course, other CPUs getting work done when they'd otherwise sit idle.

I think, but haven't tested, that your problems at 50 goroutines are 1) you already reached nearly-full utilization long ago, so adding more tasks adds more coordination work without making things go faster, and 2) you're creating goroutines for tiny sorts, which may spend more of their time setting up and coordinating than they actually do sorting. And at 10 goroutines your problem might be that you're no longer achieving full CPU utilization.

If you wanted, you could test those theories by counting the number of total goroutine launches at various goroutine limits (in an atomic global counter) and measuring CPU utilization at various limits (e.g. by running your program under the Linux/UNIX time utility).

The approach I'd suggest for a divide-and-conquer problem like this is only fork off a goroutine for large enough subproblems (for quicksort, that means large enough subarrays). You can try different limits: maybe you only start goroutines for pieces that are more than 1/64th of the original array, or pieces above some static threshold like 1000 items.

And you meant this sort routine as an exercise, I suspect, but there are various things you can do to make your sorts faster or more robust against weird inputs. The standard libary sort falls back to insertion sort for small subarrays and uses heapsort for the unusual data patterns that cause quicksort problems.

You can also look at other algorithms like radix sort for all or part of the sorting, which I played with. That sorting library is also parallel. I wound up using a minimum cutoff of 127 items before I'd hand a subarray off for other goroutines to sort, and I used an arrangement with a fixed pool of goroutines and a buffered chan to pass tasks between them. That produced decent practical speedups at the time, though it was likely not the best approach at the time and I'm almost sure it's not on today's Go scheduler. Experimentation is fun!

Well said! Maybe it is worth stating that launching more goroutines than there are cores which can execute these goroutine actually in parallel probably will not lead to any speedup (as you explained as coordination overhead increases while actual sorting doesn't). One more point maybe worth mentioning: The goroutines are competing for memory access. Depending on what you sort (ints are small, other data structures might be larger or not even continuous in memory) this will have dramatic effect if too many goroutines are launched. — Volker, Jun 27 '17 at 06:18
I worked in the memory bandwidth thing, and that you just can't count on a linear speedup generally. I revised around NumCPU stuff but it's very hard to find exactly the right words: obviously >NumCPU tasks doesn't mean more parallelism, but splitting a sort into, say, 24 tasks on 8 cores gives the scheduler a way to more evenly distribute work across cores in spite of the unevenly-distributed sizes of individual quicksort subtasks (since partitioning doesn't give you perfect halves). Worker pool would be better (and mentioned that), but can tolerate slightly-too-many goroutines for balancing. — twotwotwo, Jun 27 '17 at 08:32
You are right. But doubt the OP has access to a 2048 CPU NUMA machine :-) — Volker, Jun 27 '17 at 09:24
Take a look at this: https://play.golang.org/p/lf-0lzUSKi No matter how I actually limit the number of running goroutines, `ConcSort` would always be slower than `NormalSort`. Isn't it odd? — minecraftplayer1234, Jun 27 '17 at 14:39
You want `go func(i int) {sort.Ints(AA[i]); wg.Done() <-sem}(i)` -- make `i` an explicit param of your closure to avoid [this issue with var capture](https://github.com/golang/go/wiki/CommonMistakes#using-goroutines-on-loop-iterator-variables) and `<-sem` because it looks like you want to receive those `true` values you send eventually. I think that would help if you're running on the latest Go and the subarrays are big (try, like, 100K each). But again, concurrent goroutines isn't the number you want to focus on, it's whether you're backgrounding large enough pieces of work to be worth it. — twotwotwo, Jun 27 '17 at 18:04

score 0 · Answer 2 · answered May 18 '21 at 08:08

0

If the operation is CPU bounded, my experiments show that the optimal is the number of CPUs.

template

answered May 18 '21 at 08:08

Alberto Salvia Novella

950
9
16

Most efficient number of goroutines on this machine

2 Answers2