Parallel Repa code doesn't create sparks

Question

I'm writing code to do a subset product: it takes a list of elements and a list of indicator variables (of the same length). The product is computed in a tree, which is crucial to our application. Each product is expensive, so my goal was to compute each level of the tree in parallel, evaluating consecutive levels in sequence. Thus there isn't any nested parallelism going on.

I only have repa code in ONE function, near the top level of my overall code. Note that subsetProd is not monadic.

The steps:

chunk up the lists into pairs (no parallelism)
zip the chunked lists (no parallelism)
map the product function over this list (using Repa map), creating a Delayed array
call computeP to evaluate the map in parallel
convert the Repa result back to a list
make a recursive call (on lists half the size of the inputs)

The code:

{-# LANGUAGE TypeOperators, FlexibleContexts, BangPatterns #-}

import System.Random
import System.Environment (getArgs)
import Control.Monad.State
import Control.Monad.Identity (runIdentity)

import Data.Array.Repa as Repa
import Data.Array.Repa.Eval as Eval
import Data.Array.Repa.Repr.Vector

force :: (Shape sh) => Array D sh e -> Array V sh e
force = runIdentity . computeP

chunk :: [a] -> [(a,a)]
chunk [] = []
chunk (x1:x2:xs) = (x1,x2):(chunk xs)

slow_fib :: Int -> Integer
slow_fib 0 = 0
slow_fib 1 = 1
slow_fib n = slow_fib (n-2) + slow_fib (n-1) 

testSubsetProd :: Int -> Int -> IO ()
testSubsetProd size seed = do
    let work = do
            !flags <- replicateM size (state random)
            !values <- replicateM size (state $ randomR (1,10))
            return $ subsetProd values flags
        value = evalState work (mkStdGen seed)
    print value

subsetProd :: [Int] -> [Bool] -> Int
subsetProd [!x] _ = x
subsetProd !vals !flags = 
    let len = (length vals) `div` 2
        !valpairs = Eval.fromList (Z :. len) $ chunk vals :: (Array V (Z :. Int) (Int, Int))
        !flagpairs = Eval.fromList (Z :. len) $ chunk flags :: (Array V (Z :. Int) (Bool, Bool))
        !prods = force $ Repa.zipWith mul valpairs flagpairs
        mul (!v0,!v1) (!f0,!f1)
            | (not f0) && (not f1) = 1
            | (not f0) = v0+1
            | (not f1) = v1+1
            | otherwise = fromInteger $ slow_fib ((v0*v1) `mod` 35)
    in subsetProd (toList prods) (Prelude.map (uncurry (||)) (toList flagpairs))

main :: IO ()
main = do
  args <- getArgs
  let [numleaves, seed] = Prelude.map read args :: [Int]
  testSubsetProd numleaves seed

The entire program is compiled with

ghc -Odph -rtsopts -threaded -fno-liberate-case -funfolding-use-threshold1000 -funfolding-keeness-factor1000 -fllvm -optlo-O3

per these instructions, on GHC 7.6.2 x64.

I ran my program (Subset) using

$> time ./Test 4096 4 +RTS -sstderr -N4

8 seconds later later:

672,725,819,784 bytes allocated in the heap
 11,312,267,200 bytes copied during GC
   866,787,872 bytes maximum residency (49 sample(s))
   433,225,376 bytes maximum slop
        2360 MB total memory in use (0 MB lost due to fragmentation)

                                Tot time (elapsed)  Avg pause  Max pause


  Gen  0     1284212 colls, 1284212 par   174.17s   53.20s     0.0000s    0.0116s
  Gen  1        49 colls,    48 par   13.76s    4.63s     0.0946s    0.6412s

  Parallel GC work balance: 16.88% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time  497.80s  (448.38s elapsed)
  GC      time  187.93s  ( 57.84s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time  685.73s  (506.21s elapsed)

  Alloc rate    1,351,400,138 bytes per MUT second

  Productivity  72.6% of total user, 98.3% of total elapsed

gc_alloc_block_sync: 8670031
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 571398

My code does get slower as I increase the -N parameter, (7.628 seconds for -N1, 7.891 seconds for -N2, 8.659 seconds for -N4) but I'm getting 0 sparks created, which seems like a prime suspect as to why I'm not getting any parallelism. Also, compiling with a whole slew of optimizations helps with the runtime, but not the parallelism.

Threadscope confirms that no serious work is being done on three HECs, but the garbage collector seems to be using all 4 HECs.

threadscope for the -sstderr block above

So why isn't Repa making any sparks? My product tree has 64 leaves, so even if Repa made a spark for every internal node, there should be ~63 sparks. I feel like it could have something to do with my use of the ST monad encapsulating the parallelism, though I'm not quite sure why this would cause an issue. Perhaps sparks can only be created in an IO monad?

If this is the case, does anyone have an idea of how I could perform this tree product where each level is done in parallel (without resulting in nested parallelism, which seems unnecessary for my task). In general, perhaps there is a better way to parallelize the tree product or make better use of Repa.

Bonus points for explaining why the runtime increases as I increase the -N parameter, even when no sparks are created.

EDIT I changed the code example above to be a compiling example of my problem. The program flow almost perfectly matches my real code: I randomly choose some inputs, and then do a subset product on them. I am now using the identity monad. I have tried lots of small changes to my code: inlining or not, bang patterns or not, variations on using two Repa lists and a Repa zipWith vs zipping the lists sequentially and using a Repa map, etc, none of which helped at all.

Even if I'm running into this problem in my example code, my real program is much larger.

To make a more detailed analysis of the performance problem, it would really help to have a somewhat cut down but compilable version of your code. Some minor remarks: I'm not sure if it's worth building up a manifest array `pairs` first. That could be delayed as well. Why use lists of length `2` rather than pairs? The resulting array could be unboxed, as it contains `Int`. You should try producing an eventlog and running threadscope to see if parallelism occurs in some phases of your program. — kosmikus, Apr 19 '13 at 06:43
Oh, and I only now see that `subsetProd` is recursive. Are you sure that you want to convert the array to a list only to recompute an array from it in every step? — kosmikus, Apr 19 '13 at 06:47
@JohnL This is a just a snippet of a much larger program. There are several "external" calls in the code above, but they are all to pure, sequential, (expensive) functions. — crockeea, Apr 19 '13 at 14:15
@kosmikus Manifest arrays shouldn't be delayed, right? I thought that was the point. I'm using lists because there is a nice function (in Data.List.Split) to do the chunking for me. Threadscope shows absolutely zero parallelism among the program code until the last milisecond of execution, as I said in the question. The GC runs on all four threads. I can make a small example that does an integer subset product or something, but it will be too fast to get any parallelization out of, I'm afraid. — crockeea, Apr 19 '13 at 14:19
@kosmikus I did mean to convert to Repa arrays at every level: I don't have a way of chunking up a Repa vector, nor of applying maps on pairs of elements. The arrays are very short (the longest has 64 elements, then they get smaller by half each time). — crockeea, Apr 19 '13 at 14:24

kosmikus · Accepted Answer · 2013-04-23T07:38:40.420

Why is there no parallelism?

The main reason (at least for your now simplified and working) program for there being no parallelism is that you're using computeP on an array of V representation, and normal vectors aren't strict in their element types. So you aren't actually doing any real work in parallel. The easiest fix is to use an unboxed U array as the result, by changing force to this definition:

force :: (Shape sh, Unbox e) => Array D sh e -> Array U sh e
force a = runIdentity (computeP a)

I do recall that in your original code you claimed you're working with a complicated datatype that isn't unboxed. But is it really impossible to make it so? Perhaps you can extract the data you actually need into some unboxable representation? Or make the type an instance of the Unbox class? If not, then you can also use the following variant of force that works for a V-array:

import Control.DeepSeq (NFData(..))

...

force :: (Shape sh, NFData e) => Array D sh e -> Array V sh e
force a = runIdentity $ do
  r  <- computeP a
  !b <- computeUnboxedP (Repa.map rnf r)
  return r

The idea here is that we first compute the V-array structure, and then we compute a U-array of () type from it by mapping rnf over the array. The resulting array is uninteresting, but each of the V-array's elements will be forced in the process¹.

Either of these changes brings runtime for a problem size of 4096 from ~9 down to ~3 seconds with -N4 on my machine.

In addition, I think it's strange that you convert between lists and arrays in every step. Why not make subsetProd take two arrays? Also, at least for the values, using an intermediate V array for the pairs seems unnecessary, you could just as well use a D array. But in my experiments these changes didn't have a significant beneficial effect on runtime.

Why are there no sparks?

Repa does never create sparks. Haskell has many different approaches to parallelism, and sparks are one particular mechanism that has special support in the run-time system. However, only some libraries, for example the parallel package and one particular scheduler of the monad-par package, actually make use of the mechanism. Repa, however, does not. It uses forkIO, i.e., threads, internally, but provides a pure interface to the outside. So the absence of sparks is in itself nothing to worry about.

_{1. I originally had no idea how to do that, so I asked Ben Lippmeier, the author of Repa. Thanks a lot to Ben for pointing out the option of mapping rnf to produce a different array, and the fact that there's an Unbox instance for (), to me.}

I'm quite shocked by that using Unboxed arrays makes the parallelism work. I will certainly put a lot more effort into making my type Unbox, but it might be difficult. As far as why I'm not using arrays to begin with, see my comment on the original question about not being able to work with arrays like I can work with lists. Thanks for your help! — crockeea, Apr 20 '13 at 18:43
@Eric Chunking up an array is simple enough. For example, you can create a delayed array like this: `fromFunction (Z :. len) $ \ (Z :. i) -> (vals ! (Z :. 2 * i), vals ! (Z :. 2 * i + 1))`. — kosmikus, Apr 20 '13 at 20:55
@Eric I've edited the answer once more, because Ben Lippmeier explained to me how to make it work for `V`-arrays if you have to. — kosmikus, Apr 23 '13 at 07:39
Thanks for the update. I did get some parallelism with your suggestion, but parallel strategies (creating a spark for every internal node) handily beat Repa. — crockeea, May 03 '13 at 01:55

Parallel Repa code doesn't create sparks

1 Answers1

Why is there no parallelism?

Why are there no sparks?