I can't duplicate your results. (Which is a good thing, since I think I was the one claiming to see a performance improvement with -N2
and -N4
with the code you posted.)
On Linux with GHC 8.8.3, and compiling to a standalone executable with -O2 -threaded
, I get the following timings on a 4-core desktop:
$ stack ghc -- --version
Stack has not been tested with GHC versions above 8.6, and using 8.8.3, this may fail
Stack has not been tested with Cabal versions above 2.4, but version 3.0.1.0 was found, this may fail
The Glorious Glasgow Haskell Compilation System, version 8.8.3
$ stack ghc -- -O2 -threaded QuickSort3.hs
Stack has not been tested with GHC versions above 8.6, and using 8.8.3, this may fail
Stack has not been tested with Cabal versions above 2.4, but version 3.0.1.0 was found, this may fail
[1 of 1] Compiling Main ( QuickSort3.hs, QuickSort3.o )
Linking QuickSort3 ...
$ ./QuickSort3 +RTS -N1
10741167410134688
SORT TIME: 7.671760902s
$ ./QuickSort3 +RTS -N2
10741167410134688
SORT TIME: 5.700858877s
$ ./QuickSort3 +RTS -N3
10741167410134688
SORT TIME: 4.88330669s
$ ./QuickSort3 +RTS -N4
10741167410134688
SORT TIME: 4.93364958s
I get similar results with a 16-core Linux laptop and also similar results with a 4-core Windows virtual machine (also using GHC 8.8.3) running on that laptop.
I can think of a few possible explanations for your results.
First, I don't have a tremendously fast desktop machine, so your timings of 20secs seem suspicious. Is it possible you're doing something like:
$ stack runghc QuickSort3.hs +RTS -N4
If so, this passes the +RTS
flags to stack
, and then runs the Haskell program in single-threaded mode using the slow byte-code interpreter. In my tests, the sort then takes about 30secs no matter what -Nx
flag value I pass.
Second, is it possible you're running this on a virtual machine with a limited number of cores (or an extremely old piece of two-core hardware)? As noted, I tried testing under a Windows virtual machine and got similar results to the Linux version with a 4-core virtual machine but quite erratic results with a 2-core virtual machine (e.g., 11.4, 13.0, and 51.3secs for -N1
, -N2
, and -N4
respectively, so worse performance for more cores in general, and off-the-charts bad performance for 4 cores).
You could try the following simple parallel sums benchmark, which might scale better:
import Data.List
import Control.Parallel
import Data.Time.Clock.POSIX
randomList :: Int -> Int -> [Int]
randomList seed n = take n $ tail (iterate lcg seed)
where lcg x = (a * x + c) `rem` m
a = 1664525
c = 1013904223
m = 2^32
main :: IO ()
main = do
t1 <- getPOSIXTime
let n = 50000000
a = sum $ randomList 1 n
b = sum $ randomList 2 n
c = sum $ randomList 3 n
d = sum $ randomList 4 n
e = sum $ randomList 5 n
f = sum $ randomList 6 n
g = sum $ randomList 7 n
h = sum $ randomList 8 n
print $ a `par` b `par` c `par` d `par` e `par` f `par` g `par` h `par` (a+b+c+d+e+f+g+h)
t2 <- getPOSIXTime
putStrLn $ "SORT TIME: " ++ show (t2 - t1) ++ "\n"