Performance issue with parallel computation in Haskell

Question

I'm comparing the performance of two haskell programs running the same computation.

The first one is sequential:

main :: IO()
main = putStr $ unlines . map (show . solve) $ [100..107]
  where solve x = pow x (10^7) (982451653)

The second one uses Control.Parallel.Strategies:

import Control.Parallel.Strategies

main :: IO()
main = putStr $ unlines . parMap rdeepseq (show . solve) $ [100..107]
  where solve x = pow x (10^7) (982451653)

In both cases, pow is the modular exponentiation naively implemented as:

pow :: Int -> Int -> Int -> Int
pow a 0 m = 1
pow a b m = a * (pow a (b-1) m) `mod` m

The sequential program runs in about 3 seconds using, as expected, 100% CPU.

$ stack ghc seq.hs -- -O2
$ \time -f "%e s - %P" ./seq > /dev/null
2.96 s - 100%

The parallel program also runs in about 3 seconds using 100% CPU when limited to a single core.

$ stack ghc par.hs -- -O2 -threaded
$ \time -f "%e s - %P" ./par +RTS -N1 > /dev/null
3.14 s - 99%

But when I ran it on 4 cores, I did not observe the performance gain I was expected:

$ \time -f "%e s - %P" ./par +RTS -N4 > /dev/null
3.31 s - 235%

Even more surprising, the sequential program uses more than 100% CPU when run on several cores:

$ stack ghc seq.hs -- -O2 -threaded
$ \time -f "%e s - %P" ./seq +RTS -N4 > /dev/null
3.26 s - 232%

How can those results be explained?

EDIT - As advised by @RobertK and @Yuras, I replaced the rdeeseq by rpar and it did fix the initial issue. However, the performance is still much less than what I expected:

$ stack ghc par.hs -- -O2 -threaded
$ \time -f "%e s - %P" ./par +RTS -N1 > /dev/null
3.12 s - 99%
$ \time -f "%e s - %P" ./par +RTS -N4 > /dev/null
1.91 s - 368%

The execution time is barely divided by two even though the 4 cores are running more than 90% of the time on average.

Also, some parts of the threadscope graph look very sequential:

Yuras · Accepted Answer · 2018-07-17T13:47:24.833

2

First of all, rdeepseq seems to be buggy. Try to run ./seq +RTS -N4 -s, and you'll see no sparks created. That is why you don't see any speedup on 4 cores. Use rnf x ‘pseq‘ return x instead.

Also note GC statictics in +RTS -s output. Actually GC takes most of the CPU. With -N4 you have 4 parallel GC running, they take more time. That is why sequencial progral takes much more CPU on 4 cores. Basically you have 3 GC threads idle in spin lock waiting for synchronization. The do nothing useful, by eat CPU in a busy loop. Try to limit number of parallel GC threads using -qn1 option.

Regarding performance gain. You should not expect perfect scaling. Also I think you have 1 fizzled spark -- it is evaluated in parallel, but its result is not used.

Added: Comparing with the python implementation you linked in the comments, I see that you are using completely different algorithm in haskell. More or less similar approach is the next (requires BangPatterns):

pow :: Int -> Int -> Int -> Int
pow a b m = go 1 b
  where
  go !r 0 = r
  go r b' = go ((r * a) `mod` m) (pred b')

Your ariginal algorithm uses stack to build the result, so it is bound by GC, not by actuall computation. So you don't see big speedup. With new one I see 3x speedup (I had to increase amount of work to see the speedup because the algorithm becomes too slow).

edited Jul 17 '18 at 13:47

answered Jul 16 '18 at 17:04

Yuras

13,856
1
45
58

Thanks, replacing `redeepseq` with `rpar` fixed it. However, the performance gain is still much less than I expected: 3.1 s with `-N1` and 1.9 s with `-N4`. About the sequential program, I don't see why the GC would use an extra 132% when there is nothing to collect. – Vincent Jul 17 '18 at 09:38
I believe the GC threads are asleep untill GC can initiate, they are not stuck in a busy loop. – Robert Jul 17 '18 at 11:10
Thanks for the edit. I actually chose to run a small number of long computation so the program can achieve close to perfect scaling. For instance, [this python code](https://gist.github.com/vxgmichel/75467f62673ad34a1567e5d66c6c3925) has a scaling factor of 3.6 over 4 cores. – Vincent Jul 17 '18 at 12:15
@RobertK they are in a busy loop while GC is in progress, but there is not enough work for all GC threads. See there: https://github.com/ghc/ghc/blob/a122d4fdd0a5858e44f9d3be90a258903e0288b2/rts/sm/GC.c#L1103 – Yuras Jul 17 '18 at 13:03
Oh seems i was mistaken, thanks for the reference. @Yuras – Robert Jul 17 '18 at 13:31
Thanks for you second edit, I got the 3x speedup you were talking about. Weirdly enough, [2 of the 4 cores are staying idle](https://i.stack.imgur.com/c56aw.png) during the first second. – Vincent Jul 17 '18 at 15:54
1

`rdeepseq` is not buggy. That issue was actually caused by a buggy implementation of `rparWith` that allowed the optimizer to mess it up. It's fixed in the master branch. I don't know if there's been a release. – dfeuer Jul 18 '18 at 01:25

Robert · Answer 2 · 2018-07-17T07:36:41.527

1

I do not believe your parallel example is parallel. parMap accepts a strategy, and your strategy simply tells it to perform a deepseq. You need to combine this strategy with one that defines the parallel behaviour, e.g rpar. You are telling haskell 'perform this map, using this strategy', and right now your strategy does not define any parallel behaviour.

Also make sure that you compile your program specifying the -rtsopts flag (I do not know if stack does this for you, but ghc requires it to enable runtime options).

edited Jul 17 '18 at 07:36

answered Jul 17 '18 at 07:01

Robert

277
1
16

Thanks, replacing `redeepseq` with `rpar` fixed it. However, the performance gain is still much less than I expected: 3.1 s with `-N1` and 1.9 s with `-N4`. I'll edit my question to include relevant information. – Vincent Jul 17 '18 at 09:39
Haskell is a lazy language. You are telling it 'evaluate this in parallel', and it only evaluates it as much as it needs to. It might return an expression which is not fully evaluated, leaving work for the 'main' thread. https://stackoverflow.com/questions/6872898/haskell-what-is-weak-head-normal-form You need to combine the rpar strategy with the ```rdeepseq``` one, such that haskell interprets it as 'in parallel, evaluate this to normal form'. Consider this function when combining strategies: http://hackage.haskell.org/package/parallel-3.2.2.0/docs/Control-Parallel-Strategies.html#v:dot – Robert Jul 17 '18 at 10:46
Simon Marlow has an excellent book on parallel and concurrent functional programming https://web.archive.org/web/20171207155221/http://chimera.labs.oreilly.com:80/books/1230000000929/index.html It explains all these behaviours very nicely. Once you get this to work okay you could look into granularity control, making sure the parallel tasks are big enough. If the task is small the work of creating the spark might dominate the entire work done. @Vincent – Robert Jul 17 '18 at 10:52
Thanks for all this information! I used ``rpar `dot` rdeepseq`` as a strategy and added `force` (from `Control.DeepSeq`) before and after `parMap` to make sure haskell knows all those values need to be computed. However, I did not notice any improvement and the [threadscope graph](https://i.stack.imgur.com/2elsb.png) looks very similar. – Vincent Jul 17 '18 at 11:32
I would suggest looking into granularity control, moving forward. If your list contains 10 000 elements and the computations are fairly trivial, you do not want to create 10 000 sparks. Perhaps you wish to spawn 1 000 sparks which each compute 100 sequential mappings? This requires some trial and error to find a nice granularity. Also, you can try turning off the garbace collector (by allocating a large amount of space for it), while trying out your parallel strategies. @Vincent – Robert Jul 17 '18 at 11:42
Thanks again, I'll look into that. In python, the granularity is typically controlled through the `chunksize` argument of [Pool.map](https://docs.python.org/3.7/library/multiprocessing.html#multiprocessing.pool.Pool.map), I'm surprised that `Concurrent.Parallel.Strategies` doesn't provide the same kind of feature. – Vincent Jul 17 '18 at 15:58
@Vincent passing an integer defining the chunksize versus writing a wrapper for your function which controls the exact same thing is of little difference in difficulty. You can parallelise much more than just maps, and it makes no sense to define chunksizes for simply maps. It is up to you. – Robert Jul 18 '18 at 05:10

Performance issue with parallel computation in Haskell

2 Answers2