Writing "fib" to run in parallel: -N2 is slower?

Question

I'm learning Haskell and trying write code to execute in parallel, but Haskell always runs it sequentially. And when I execute with the -N2 runtime flag it take more time to execute than if I omit this flag.

Here is code:

import Control.Parallel
import Control.Parallel.Strategies

fib :: Int -> Int
fib 1 = 1
fib 0 = 1
fib n = fib (n - 1) + fib (n - 2)

fib2 :: Int -> Int
fib2 n = a `par` (b `pseq` (a+b))
    where a = fib n
          b = fib n + 1

fib3 :: Int -> Int
fib3 n = runEval $ do
                a <- rpar (fib n)
                b <- rpar (fib n + 1)
                rseq a
                rseq b
                return (a+b)

main = do putStrLn (show (fib3 40))

What did I do wrong? I tried this sample in Windows 7 on Intel core i5 and in Linux on Atom.

Here is log from my console session:

ghc -rtsopts -threaded -O2 test.hs
[1 of 1] Compiling Main             ( test.hs, test.o )

test +RTS -s
331160283
          64,496 bytes allocated in the heap
           2,024 bytes copied during GC
          42,888 bytes maximum residency (1 sample(s))
          22,648 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  Parallel GC work balance: nan (0 / 0, ideal 1)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    (  6.59s)       0.00s    (  0.00s)
  Task  1 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  2 (bound)  :    6.33s    (  6.59s)       0.00s    (  0.00s)

  SPARKS: 2 (0 converted, 0 pruned)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    6.33s  (  6.59s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    6.33s  (  6.59s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    10,191 bytes per MUT second

  Productivity 100.0% of total user, 96.0% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync_large_objects: 0
gen[1].sync_large_objects: 0


test +RTS -N2 -s 
331160283
          72,688 bytes allocated in the heap
           5,644 bytes copied during GC
          28,300 bytes maximum residency (1 sample(s))
          24,948 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     1 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     1 parallel,  0.00s,  0.01s elapsed

  Parallel GC work balance: 1.51 (937 / 621, ideal 2)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    (  9.29s)       0.00s    (  0.00s)
  Task  1 (worker) :    4.53s    (  9.29s)       0.00s    (  0.00s)
  Task  2 (bound)  :    5.84s    (  9.29s)       0.00s    (  0.01s)
  Task  3 (worker) :    0.00s    (  9.29s)       0.00s    (  0.00s)

  SPARKS: 2 (1 converted, 0 pruned)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   10.38s  (  9.29s elapsed)
  GC    time    0.00s  (  0.01s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   10.38s  (  9.30s elapsed)

  %GC time       0.0%  (0.1% elapsed)

  Alloc rate    7,006 bytes per MUT second

  Productivity 100.0% of total user, 111.6% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync_large_objects: 0
gen[1].sync_large_objects: 0

There is a slightly diferrent `nfib` at http://www.haskell.org/ghc/docs/6.6/html/users_guide/lang-parallel.html — , Jan 26 '12 at 20:25
Cannot reproduce (Core i5, linux), real time 2.75s without `-Nk`, 1.48s with `-N2`. Which GHC version and what compiler options did you use? — Daniel Fischer, Jan 26 '12 at 20:31
I've add log from my session under Windows 7(Pentium D) and Haskell Platform 2011.4.0.0. Compiler flags are : ghc -rtsopts -threaded -O2 test.hs — KolKir, Jan 26 '12 at 20:45
Atom has 2 cores, right? AFAIK i5 can have 2 or 4 core, which one do you have? It could be "the last core" issue. — Yuras, Jan 26 '12 at 21:04
@Yuras But the i5 Cores have 2 threads each, so the last core slowdown wouldn't bite there. Also, if memory serves, it wasn't an issue on Windows. **But** if both threads get scheduled to the same physical core, `-N2` would be slower than `-N1` (hyperthreading only works well if the two threads use different parts of the CPU, if both compete for the ALU, it's slower than a single thread). I've never had such a big difference on my old P4 in such cases though. — Daniel Fischer, Jan 26 '12 at 21:16
I used GHC 7.0.4. My Core i5 have 4 physical cores, Atom one but with phyperthreading, Pentium D have 2 physical cores. — KolKir, Jan 26 '12 at 22:29
@KolKir: don't know what was happening last night, but as of today I can't reproduce either. What happens if you try running with different gc options, e.g. `+RTS -qg -qa`? — John L, Jan 27 '12 at 12:11

score 4 · Answer 1 · answered Jan 27 '12 at 14:29

I think answer is that "GHC will optimise the fib function so that it does no allocation, and computations that do no allocation cause problems for the RTS because the scheduler never gets to run and do load-balancing (which is necessary for parallelism)" as wrote Simon in this discussion group. Also I found good tutorial.

Writing "fib" to run in parallel: -N2 is slower?

1 Answers1

Linked