9

I have an implementation of Conway's Game of Life. I want to speed it up if possible by using parallelism.

life :: [(Int, Int)] -> [(Int, Int)]
life cells = map snd . filter rules . freq $ concatMap neighbours cells
    where rules (n, c) = n == 3 || (n == 2 && c `elem` cells)
          freq = map (length &&& head) . group . sort

parLife :: [(Int, Int)] -> [(Int, Int)]
parLife cells = parMap rseq snd . filter rules . freq . concat $ parMap rseq neighbours cells
    where rules (n, c) = n == 3 || (n == 2 && c `elem` cells)
          freq = map (length &&& head) . group . sort

neigbours :: (Int, Int) -> [(Int, Int)]
neighbours (x, y) = [(x + dx, y + dy) | dx <- [-1..1], dy <- [-1..1], dx /= 0 || dy /= 0]

in profiling, neighbours accounts for 6.3% of the time spent, so while small I expected a noticable speedup by mapping it in parallel.

I tested with a simple function

main = print $ last $ take 200 $ iterate life fPent
    where fPent = [(1, 2), (2, 2), (2, 1), (2, 3), (3, 3)]

and compiled the parallel version as

ghc --make -O2 -threaded life.hs

and ran it as

./life +RTS -N3

it turns out that the parallel version is slower. Am I using parMap incorrectly here? is this even a case where parallelism can be used?

cdk
  • 6,698
  • 24
  • 51
  • Firstly, do you have at least 3 cores in your computer? Secondly, parallelism always comes with some overhead, so if the work being done by each thread is very small, the extra overhead will outweigh any speed-ups. – huon Sep 01 '12 at 08:24
  • i have an i5-2500k, so there is definitely up to 4 cores avaliable – cdk Sep 01 '12 at 13:34
  • Note that you can get much larger speedups from improving the algorithm than from parallelising. The bulk of the time is spent in `sort` and `elem`. Using the fact that the list of cells is sorted (and changing `fPent` so that it is sorted) you can roughly halve the time. – Daniel Fischer Sep 01 '12 at 14:22
  • @DanielFischer: the list is not necessarily sorted if fPent is sorted. freq takes the list of every cell neighbouring a live cell as its input, and the same cell could be the neigbour of many different live cells and appear scattered throughout the list. If there was a way to be able to find the total number of occurences of each unique element in the list faster than sorting, that would indeed improve the algorithm – cdk Sep 07 '12 at 23:14
  • Chris, you sort the list in the step: `freq = map (length &&& head) . group . sort`, so the `cells` for the next generation are always sorted. – Daniel Fischer Sep 08 '12 at 16:11

1 Answers1

2

I don't think you're measuring right. Your parLife is indeed a bit faster than life. In fact, on my machine (Phenom X4, 4 core,) the former only takes about 92.5% of the time the latter does, which considering you're saying you're expecting only a 6% improvement is quite good.

What is your benchmarking setup? Have you tried using criterion? Here's what I did:

import Criterion
import Criterion.Main

-- your code, minus main

runGame f n = last $ take n $ iterate f fPent
    where fPent = [(1, 2), (2, 2), (2, 1), (2, 3), (3, 3)]

main = defaultMain
    [ bench "No parallelism 200" $ whnf (runGame life)    200
    , bench "Parallelism 200"    $ whnf (runGame parLife) 200 ]

Compiled with ghc --make -O2 -o bench and ran with ./bench -o bencht.hmtl +RTS -N3.

Here's the detailed result of the report.

Aleksandar Dimitrov
  • 9,275
  • 3
  • 41
  • 48
  • Hmm, strange. I also get the result that `parLife` is faster from criterion, but when I run the thing standalone, `parLife` is consistently significantly slower than `life`. – Daniel Fischer Sep 01 '12 at 14:11
  • Ah, only with the threaded runtime, not with the nonthreaded! – Daniel Fischer Sep 01 '12 at 14:15
  • I think it has something to do with the longevity of the process… I.e. initializing the thread pool etc. is more expensive than the (admittedly minor) gains we get from parallelizing. – Aleksandar Dimitrov Sep 01 '12 at 14:18
  • Possibly. But I ran with 500 iterations to get more reliable timings. That's long enough that initialising the thread pool etc. shouldn't matter. Probably the threaded runtime has higher overhead for sparking. – Daniel Fischer Sep 01 '12 at 14:27
  • Oh, but wait! The non-threaded runtime doesn't even support parallelism, no sparks there! – Daniel Fischer Sep 01 '12 at 14:38
  • Yeah. It might be worth it to run it through thread-scope to see what's happening exactly. But, as you've suggested further up, there are more obvious points for speeding up this code. – Aleksandar Dimitrov Sep 01 '12 at 14:39