I'm writing code to do a subset product: it takes a list of elements and a list of indicator variables (of the same length). The product is computed in a tree, which is crucial to our application. Each product is expensive, so my goal was to compute each level of the tree in parallel, evaluating consecutive levels in sequence. Thus there isn't any nested parallelism going on.
I only have repa code in ONE function, near the top level of my overall code. Note that subsetProd is not monadic.
The steps:
- chunk up the lists into pairs (no parallelism)
- zip the chunked lists (no parallelism)
- map the product function over this list (using Repa map), creating a Delayed array
- call computeP to evaluate the map in parallel
- convert the Repa result back to a list
- make a recursive call (on lists half the size of the inputs)
The code:
{-# LANGUAGE TypeOperators, FlexibleContexts, BangPatterns #-}
import System.Random
import System.Environment (getArgs)
import Control.Monad.State
import Control.Monad.Identity (runIdentity)
import Data.Array.Repa as Repa
import Data.Array.Repa.Eval as Eval
import Data.Array.Repa.Repr.Vector
force :: (Shape sh) => Array D sh e -> Array V sh e
force = runIdentity . computeP
chunk :: [a] -> [(a,a)]
chunk [] = []
chunk (x1:x2:xs) = (x1,x2):(chunk xs)
slow_fib :: Int -> Integer
slow_fib 0 = 0
slow_fib 1 = 1
slow_fib n = slow_fib (n-2) + slow_fib (n-1)
testSubsetProd :: Int -> Int -> IO ()
testSubsetProd size seed = do
let work = do
!flags <- replicateM size (state random)
!values <- replicateM size (state $ randomR (1,10))
return $ subsetProd values flags
value = evalState work (mkStdGen seed)
print value
subsetProd :: [Int] -> [Bool] -> Int
subsetProd [!x] _ = x
subsetProd !vals !flags =
let len = (length vals) `div` 2
!valpairs = Eval.fromList (Z :. len) $ chunk vals :: (Array V (Z :. Int) (Int, Int))
!flagpairs = Eval.fromList (Z :. len) $ chunk flags :: (Array V (Z :. Int) (Bool, Bool))
!prods = force $ Repa.zipWith mul valpairs flagpairs
mul (!v0,!v1) (!f0,!f1)
| (not f0) && (not f1) = 1
| (not f0) = v0+1
| (not f1) = v1+1
| otherwise = fromInteger $ slow_fib ((v0*v1) `mod` 35)
in subsetProd (toList prods) (Prelude.map (uncurry (||)) (toList flagpairs))
main :: IO ()
main = do
args <- getArgs
let [numleaves, seed] = Prelude.map read args :: [Int]
testSubsetProd numleaves seed
The entire program is compiled with
ghc -Odph -rtsopts -threaded -fno-liberate-case -funfolding-use-threshold1000 -funfolding-keeness-factor1000 -fllvm -optlo-O3
per these instructions, on GHC 7.6.2 x64.
I ran my program (Subset) using
$> time ./Test 4096 4 +RTS -sstderr -N4
8 seconds later later:
672,725,819,784 bytes allocated in the heap
11,312,267,200 bytes copied during GC
866,787,872 bytes maximum residency (49 sample(s))
433,225,376 bytes maximum slop
2360 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1284212 colls, 1284212 par 174.17s 53.20s 0.0000s 0.0116s
Gen 1 49 colls, 48 par 13.76s 4.63s 0.0946s 0.6412s
Parallel GC work balance: 16.88% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 497.80s (448.38s elapsed)
GC time 187.93s ( 57.84s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 685.73s (506.21s elapsed)
Alloc rate 1,351,400,138 bytes per MUT second
Productivity 72.6% of total user, 98.3% of total elapsed
gc_alloc_block_sync: 8670031
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 571398
My code does get slower as I increase the -N parameter, (7.628 seconds for -N1, 7.891 seconds for -N2, 8.659 seconds for -N4) but I'm getting 0 sparks created, which seems like a prime suspect as to why I'm not getting any parallelism. Also, compiling with a whole slew of optimizations helps with the runtime, but not the parallelism.
Threadscope confirms that no serious work is being done on three HECs, but the garbage collector seems to be using all 4 HECs.
So why isn't Repa making any sparks? My product tree has 64 leaves, so even if Repa made a spark for every internal node, there should be ~63 sparks. I feel like it could have something to do with my use of the ST monad encapsulating the parallelism, though I'm not quite sure why this would cause an issue. Perhaps sparks can only be created in an IO monad?
If this is the case, does anyone have an idea of how I could perform this tree product where each level is done in parallel (without resulting in nested parallelism, which seems unnecessary for my task). In general, perhaps there is a better way to parallelize the tree product or make better use of Repa.
Bonus points for explaining why the runtime increases as I increase the -N parameter, even when no sparks are created.
EDIT I changed the code example above to be a compiling example of my problem. The program flow almost perfectly matches my real code: I randomly choose some inputs, and then do a subset product on them. I am now using the identity monad. I have tried lots of small changes to my code: inlining or not, bang patterns or not, variations on using two Repa lists and a Repa zipWith vs zipping the lists sequentially and using a Repa map, etc, none of which helped at all.
Even if I'm running into this problem in my example code, my real program is much larger.