Sorting in parallel performance

Question

I tried to run some programs with multicore and kinda confused by the results. By default sorting in program below takes 20 seconds, when I run it with +RTS -N2 it takes around 16 secs, but with +RTS -N4 it takes 21 second!

Why it is like that? And is there example of program that gets faster with each extra core? (had similar results with other programs in tutorials)

Here's example of program:

import Data.List
import Control.Parallel
import Data.Time.Clock.POSIX

qsort :: Ord a => [a] -> [a]
qsort (x:xs)
  = let a = qsort $ filter (<=x) xs
        b = qsort $ filter (>x)  xs
    in b `par` a ++ x:b
qsort [] = []


randomList :: Int -> [Int]
randomList n = take n $ tail (iterate lcg 1)
  where lcg x = (a * x + c) `rem` m
        a = 1664525
        c = 1013904223
        m = 2^32

main :: IO ()
main = do
  let randints = randomList 5000000
  t1 <- getPOSIXTime
  print . sum $ qsort randints
  t2 <- getPOSIXTime
  putStrLn $ "SORT TIME: " ++ show (t2 - t1) ++ "\n"

Big question. To start off, laziness makes it a bit tricky to know what's really happening in parallel. You're not actually doing as much in parallel as you think. — dfeuer, Jun 17 '20 at 22:38
You're including list generation in the time measurement. Have you tried forcing the list before starting the timer? — Fyodor Soikin, Jun 17 '20 at 22:41
You can also try to add `-A1000M` just to see if garbage collection makes a big difference. Can also recommend the program threadscope which can be used to get a visualization. — JohEker, Jun 17 '20 at 22:54
@FyodorSoikin I tried deepseq on list, it reduced time, but compared to each other results are the same — crends, Jun 17 '20 at 23:10
If you are looking for performance see my answer here: https://stackoverflow.com/questions/19752983/is-it-possible-to-speed-up-a-quicksort-with-par-in-haskell/55885118#55885118 You will also find a lot of info about quicksort in parallel and why you should abandon lists for parallelization of sorting. — lehins, Jun 18 '20 at 00:28
@lehins I'm not only interested in performance, but also parallelization. In your linked question top answer got huge performance boost on list quicksort with 4 cores (almost 2 times faster), but for me it gave no result! — crends, Jun 18 '20 at 09:30
@lehins, even staying in a more functional regime you can do better: see https://stackoverflow.com/questions/55810774/how-can-i-optimize-parallel-sorting-to-improve-temporal-performance/55862382#55862382 for my own off-the-cuff version. — dfeuer, Jun 18 '20 at 16:52
@dfeuer do you have it as a gist somewhere, I'd like to compare it against the mutable array approach. I can of course reconstruct it from the answer, but I'd rather go the lazy way ;) — lehins, Jun 18 '20 at 17:00
@lehins, no I don't. Sorry. I don't even remember where I stuck the source code. If you play around with it, please let me know what you learn. — dfeuer, Jun 18 '20 at 17:02
@dfeuer The parallel solution for sorting a list which goes though a Map that you linked to is x10 times slower then going through a mutable storable array. I don't understand why people care so much about trying to parallelize such a slow data structure as a list. https://github.com/lehins/haskell-quicksort — lehins, Jun 18 '20 at 19:51
@lehins, you mean it's *only* ten times slower? For code in an SO answer that hasn't been optimized, I think that's pretty darn good! Did you try my code with array inputs and outputs, just for fun? The lists are definitely the least interesting part of that. — dfeuer, Jun 19 '20 at 01:54
@dfeuer Can you give me a link to the code with array inputs and outputs. I'll be happy to try it for fun. The answer you linked to is pretty good, it just can never be an optimal no matter on optimizations. I also don't get the people who ask these questions. For example, I gave @ crends link to the fastest quicksort implementation I've seen so far in Haskell, which naturally uses all cores and I got back: "I'm not only interested in performance, but also parallelization. " The answer I linked to was not accepted despite the solution I provided is x1000 faster then the accepted answer. — lehins, Jun 19 '20 at 10:54
@lehins my problem was is not that quicksort was slow, but that performance wasn't really affected with more cores. That's why I said it and didn't just want to abandon lists before I figure it out. — crends, Jun 19 '20 at 13:25

score 1 · Accepted Answer · answered Jun 18 '20 at 16:27

I can't duplicate your results. (Which is a good thing, since I think I was the one claiming to see a performance improvement with -N2 and -N4 with the code you posted.)

On Linux with GHC 8.8.3, and compiling to a standalone executable with -O2 -threaded, I get the following timings on a 4-core desktop:

$ stack ghc -- --version
Stack has not been tested with GHC versions above 8.6, and using 8.8.3, this may fail
Stack has not been tested with Cabal versions above 2.4, but version 3.0.1.0 was found, this may fail
The Glorious Glasgow Haskell Compilation System, version 8.8.3
$ stack ghc -- -O2 -threaded QuickSort3.hs
Stack has not been tested with GHC versions above 8.6, and using 8.8.3, this may fail
Stack has not been tested with Cabal versions above 2.4, but version 3.0.1.0 was found, this may fail
[1 of 1] Compiling Main             ( QuickSort3.hs, QuickSort3.o )
Linking QuickSort3 ...
$ ./QuickSort3 +RTS -N1
10741167410134688
SORT TIME: 7.671760902s

$ ./QuickSort3 +RTS -N2
10741167410134688
SORT TIME: 5.700858877s

$ ./QuickSort3 +RTS -N3
10741167410134688
SORT TIME: 4.88330669s

$ ./QuickSort3 +RTS -N4
10741167410134688
SORT TIME: 4.93364958s

I get similar results with a 16-core Linux laptop and also similar results with a 4-core Windows virtual machine (also using GHC 8.8.3) running on that laptop.

I can think of a few possible explanations for your results.

First, I don't have a tremendously fast desktop machine, so your timings of 20secs seem suspicious. Is it possible you're doing something like:

$ stack runghc QuickSort3.hs +RTS -N4

If so, this passes the +RTS flags to stack, and then runs the Haskell program in single-threaded mode using the slow byte-code interpreter. In my tests, the sort then takes about 30secs no matter what -Nx flag value I pass.

Second, is it possible you're running this on a virtual machine with a limited number of cores (or an extremely old piece of two-core hardware)? As noted, I tried testing under a Windows virtual machine and got similar results to the Linux version with a 4-core virtual machine but quite erratic results with a 2-core virtual machine (e.g., 11.4, 13.0, and 51.3secs for -N1, -N2, and -N4 respectively, so worse performance for more cores in general, and off-the-charts bad performance for 4 cores).

You could try the following simple parallel sums benchmark, which might scale better:

import Data.List
import Control.Parallel
import Data.Time.Clock.POSIX

randomList :: Int -> Int -> [Int]
randomList seed n = take n $ tail (iterate lcg seed)
  where lcg x = (a * x + c) `rem` m
        a = 1664525
        c = 1013904223
        m = 2^32

main :: IO ()
main = do
  t1 <- getPOSIXTime
  let n = 50000000
      a = sum $ randomList 1 n
      b = sum $ randomList 2 n
      c = sum $ randomList 3 n
      d = sum $ randomList 4 n
      e = sum $ randomList 5 n
      f = sum $ randomList 6 n
      g = sum $ randomList 7 n
      h = sum $ randomList 8 n
  print $ a `par` b `par` c `par` d `par` e `par` f `par` g `par` h `par` (a+b+c+d+e+f+g+h)
  t2 <- getPOSIXTime
  putStrLn $ "SORT TIME: " ++ show (t2 - t1) ++ "\n"

I did tests without using stack. I have both W10 and ubuntu installed on my laptop (not virtual machine). My processor is Intel Core i5-4200U. Your code worked for me, results from -N1 to -N4: 16s, 10.5s, 9.4s, 8.7s. My performance with Haskell is really weird, I tried solving the same tasks with C++ and Python and even Python did sorting like 10 times faster! — crends, Jun 18 '20 at 18:43
Ah, well, a Core i5-4200U is technically a 2-core processor, even though it supports hyperthreading up to 4 threads, so that may explain your results. (My desktop is an overclocked i5-6600K, which is getting old but has 4 full cores.) I'm not sure why the quicksort performs worse at `-N4` while the parallel sums perform better, but the quicksort example does extensive garbage collection, and I think parallel GC performs particularly badly when the number of threads exceeds the number of real cores. — K. A. Buhr, Jun 18 '20 at 19:13
@k-a-buhr thank you. Can you also say if I should always compile with -O2? I tried sorting the same data as vector from Data.Vector and it was actually slower than list sorting without -O2 (20 secs without and like 3.5 secs with -O2) — crends, Jun 18 '20 at 21:58
Yes, I think `-O2` is considered the standard optimization level for production code and for benchmarking. The only reason to use `-O0` (the default) is that it's faster during development (and potentially much faster when recompiling large, multi-module programs after small changes, since `-O2` does a lot more cross-module recompilation than `-O0`). I don't know if anyone uses `-O1` (equivalent to `-O` with no number); it might be helpful for development if `-O0` generates code that's too slow to run. — K. A. Buhr, Jun 18 '20 at 22:23
@k-a-buhr can you also explain limiting number of sparks from your answer here: https://stackoverflow.com/a/55894549/13765493 ? Doesn't 20 layers mean that there will be at most 1kk sparks? (and it's already too much?) Wouldn't it be better to limit spark creation by checking sublist length? — crends, Jun 26 '20 at 19:07

Sorting in parallel performance

1 Answers1