6

I have a simple routine that takes the product of a vector of Double. I am attempting to parallelize this code, but many of the sparks end up fizzling. Here is a self-contained benchmark which is also provided as a gist:

{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE MagicHash #-}

{-# OPTIONS_GHC -O2 -Wall -threaded -fforce-recomp #-}

import Criterion.Main
import Control.Monad (when)
import Control.Parallel.Strategies (runEval,rpar,rseq)
import qualified Data.Vector.Primitive as PV

main :: IO ()
main = do
  let expected = PV.product numbers
  when (not (serialProduct numbers == expected)) $ do
    fail "serialProduct implementation incorrect"
  defaultMain
    [ bgroup "product"
      [ bench "serial" $ whnf serialProduct numbers
      , bench "parallel" $ whnf parallelProduct numbers
      ]
    ]

numbers :: PV.Vector Double
numbers = PV.replicate 10000000 1.00000001
{-# NOINLINE numbers #-}

serialProduct :: PV.Vector Double -> Double
serialProduct v =
  let !len = PV.length v
      go :: Double -> Int -> Double
      go !d !ix = if ix < len then go (d * PV.unsafeIndex v ix) (ix + 1) else d
   in go 1.0 0

-- | This only works when the vector length is a multiple of 8.
parallelProduct :: PV.Vector Double -> Double
parallelProduct v = runEval $ do
  let chunk = div (PV.length v) 8
  p2 <- rpar (serialProduct (PV.slice (chunk * 6) chunk v))
  p3 <- rpar (serialProduct (PV.slice (chunk * 7) chunk v))
  p1 <- rseq (serialProduct (PV.slice (chunk * 0) (chunk * 6) v))
  return (p1 * p2 * p3)

This can be built and run with:

ghc -threaded parallel_compute.hs
./parallel_compute +RTS -N4 -s

I have an eight-core box, so giving the runtime four capabilities should be fine. The benchmark results are not super important, but here they are:

benchmarking product/serial
time                 11.40 ms   (11.30 ms .. 11.53 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 11.43 ms   (11.37 ms .. 11.50 ms)
std dev              167.2 μs   (120.4 μs .. 210.1 μs)

benchmarking product/parallel
time                 10.03 ms   (9.949 ms .. 10.15 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 10.17 ms   (10.11 ms .. 10.31 ms)
std dev              235.7 μs   (133.4 μs .. 426.2 μs)

Now, the runtime statistics. This is where I'm confused:

   124,508,840 bytes allocated in the heap
   529,843,176 bytes copied during GC
    80,232,008 bytes maximum residency (8344 sample(s))
       901,272 bytes maximum slop
            83 MB total memory in use (0 MB lost due to fragmentation)

                                   Tot time (elapsed)  Avg pause  Max pause
Gen  0        19 colls,    19 par    0.008s   0.001s     0.0001s    0.0003s
Gen  1      8344 colls,  8343 par    2.916s   1.388s     0.0002s    0.0008s

Parallel GC work balance: 76.45% (serial 0%, perfect 100%)

TASKS: 13 (1 bound, 12 peak workers (12 total), using -N4)

SPARKS: 1024 (502 converted, 0 overflowed, 0 dud, 28 GC'd, 494 fizzled)

INIT    time    0.000s  (  0.002s elapsed)
MUT     time   11.480s  ( 10.414s elapsed)
GC      time    2.924s  (  1.389s elapsed)
EXIT    time    0.004s  (  0.005s elapsed)
Total   time   14.408s  ( 11.811s elapsed)

Alloc rate    10,845,717 bytes per MUT second

Productivity  79.7% of total user, 88.2% of total elapsed

In the section that deals with sparks, we can see that about half of them fizzle. This seems unbelievable to me. In parallelProduct, we have the main thread work on a task 6 times larger than what is given to either of the sparks. However, it seems like one of these sparks always gets fizzled (or GCed). And this isn't a small job either. We're talking about a computation that takes milliseconds, so it seems implausible that the main thread could finish it before the other thunks get sparked.

My understanding (which could be totally wrong) is that this kind of computation should be ideal for the concurrent runtime. Garbage collection seems to be the biggest problem for concurrent applications in GHC, but the task I'm doing here doesn't generate any almost garbage, since GHC turns the innards of serialProduct into a tight loop with everything unboxed.

On the upside, we do see an 11% speedup for the parallel version in the benchmarks. So, the eighth portion of the work that was successfully sparked really did make a measurable impact. I'm just wondering why that other spark doesn't work like I expect it to.

Any help on understanding this would be appreciated.

EDIT

I've update the gist to include another implementation:

-- | This only works when the vector length is a multiple of 4.
parallelProductFork :: PV.Vector Double -> Double
parallelProductFork v = unsafePerformIO $ do
  let chunk = div (PV.length v) 4
  var <- newEmptyMVar 
  _ <- forkIO $ evaluate (serialProduct (PV.slice (chunk * 0) chunk v)) >>= putMVar var
  _ <- forkIO $ evaluate (serialProduct (PV.slice (chunk * 1) chunk v)) >>= putMVar var
  _ <- forkIO $ evaluate (serialProduct (PV.slice (chunk * 2) chunk v)) >>= putMVar var
  _ <- forkIO $ evaluate (serialProduct (PV.slice (chunk * 3) chunk v)) >>= putMVar var
  a <- takeMVar var
  b <- takeMVar var
  c <- takeMVar var
  d <- takeMVar var
  return (a * b * c * d)

This one has excellent performance:

benchmarking product/parallel mvar
time                 3.814 ms   (3.669 ms .. 3.946 ms)
                     0.986 R²   (0.977 R² .. 0.992 R²)
mean                 3.818 ms   (3.708 ms .. 3.964 ms)
std dev              385.6 μs   (317.1 μs .. 439.8 μs)
variance introduced by outliers: 64% (severely inflated)

But, it falls back on conventional concurrency primitives instead of using sparks. I do not like this solution, but I am providing it as evidence that it should be possible to achieve the same performance with a spark-based approach.

  • Nothing jumps out at me, but what happens if you add one/both of `-A200M` and `-qa` to your RTS options? – jberryman Oct 05 '17 at 14:16
  • With `./parallel_compute +RTS -N4 -s -qa -A200M`, the results are nearly identical. Which is what I would suspect since the the program doesn't really allocate much. – Andrew Thaddeus Martin Oct 05 '17 at 14:29
  • Hm, I guess those statistics are from the criterion `main` and probably not very useful (criterion calculates statistics in parallel last time I looked, for instance). From experience I've come to distrust the way the runtime and the way the GC interacts with parallelism/concurrency with default settings. An observation: doing some quick math, your serial version is about as fast as we'd hope. It seems _possible_ that those two small chunks you're evaluating in parallel are simply suffering from bad cache behavior, but I can't make the numbers add up there – jberryman Oct 05 '17 at 14:51
  • For what it's worth, when I ran your code I only got 60/604 fizzled sparks the first time and 45/554 the second. – rampion Oct 05 '17 at 14:59
  • Possibly related? https://stackoverflow.com/q/22638421 – Will Fancher Oct 05 '17 at 17:23
  • @jberryman I've updated the question to include another implementation using `MVar`. The performance it achieves should alleviate concerns that it's a cache-related issue since the `MVar`-based approach would suffer from the same problem. – Andrew Thaddeus Martin Oct 05 '17 at 17:58
  • @WillFancher In the issue you've linked to, the problem is that the parallelism was too fine-grained. This shouldn't be an issue here since the sparks are being assigned to something that takes over a millisecond. – Andrew Thaddeus Martin Oct 05 '17 at 17:59
  • Sorry, I wasn't really clear: in your original you've sparked only two small chunks of work which will be run in parallel with the larger chunk, so you should expect a 25% speedup at most. Your MVar example isn't comparable. The remainder of the performance difference I don't know how to account for – jberryman Oct 05 '17 at 20:08
  • 1
    Sorry, I should have been more clear about the motivation of the `MVar` example. I played around with the original example a bit before I posted this question. Making the chunk size more evenly distributed doesn't help, because for some reason, a huge number of the sparks always fizzle. When I divide it evenly like I do in the `MVar` example, it's about 66% of the sparks that get fizzled, and it's the same speed as the serial version. Something I've noticed about the spark-based approach is that the order in which I multiply the subresults at the end impacts the percentage of them that fizzle. – Andrew Thaddeus Martin Oct 05 '17 at 20:56
  • Continuing my last comment, this leads me to believe that `rseq` may not actually do what I think it does. My expectation is that the main thread should block until we have completely computed the product of that slice of the vector, but I suspect that this isn't happening. – Andrew Thaddeus Martin Oct 05 '17 at 20:58
  • As I think about it more, I don't totally understand how work stealing is handled. If there is only one thread that steals work from the others, that would help explain the unsatisfactory performance I am seeing from sparks. Although, the issue related to re-ordering the final results still gives me some suspicions elsewhere. – Andrew Thaddeus Martin Oct 05 '17 at 21:26
  • I haven't worked much with pure parallelism and reviewed both the docs for the parallel-strategies lib and the intro chapter in Simon Marlow's book and came away with the same impression as you, i.e. that this should work. So I'd say that this is at least a case of poor documentation which would be good to report if you can. It does seem like something like the following happens: as soon as the `rseq` computation finishes we're allowed to return, if at that point the `rpar` sparks have not completed, they fizzle and get recomputed from the thunk you return, in `(p1 * p2 * p3)`...or something? – jberryman Oct 05 '17 at 21:47

1 Answers1

7

The problem here is that creating spark doesn't immediately wakeup idle capability, see here. By default scheduling interval is 20ms, so when you create a spark, it will take up to 20 ms to turn it to a real thread. By that time the calling thread most likely will already evaluate the thunk, and the spark will be either GC'd or fizzled.

By contrast, forkIO will immediately wakeup idle capability if any. That is why explicit concurrency is more reliable then parallel strategies.

You can workaround the issue by decreasing scheduling interval using -C option (docs). E.g. +RTS -C0.01 seems to be enough.

Yuras
  • 13,856
  • 1
  • 45
  • 58
  • 1
    This solves it. Although tweaking the scheduling interval like this seems like it might negatively affect other things. So I suppose that in the normal case, if you're going to be sparking things, you should ensure that the work done by all the sparks plus the main thread takes well over 20ms. Otherwise, nearly everything will fizzle unless scheduling happens to be coming soon. I've always wondered about the threshold for how fine-grained sparks should be, and my understanding is now that this is roughly it. Also, where did you find documentation of `-C`? I cannot find it. – Andrew Thaddeus Martin Oct 06 '17 at 12:34
  • @AndrewThaddeusMartin I added a link to the docs. Re the threshold - yes, that is my impression too. Though I never used parallel strategies in practice. – Yuras Oct 06 '17 at 12:41
  • @AndrewThaddeusMartin Probably it makes sense to file a bug to see what ghc devs think about spark scheduling. I wonder whether it is possible to wakeup a capability immediately. – Yuras Oct 06 '17 at 12:44