Fast sorting in Haskell

Question

After reading Stack Overflow question Using vectors for performance improvement in Haskell describing a fast in-place quicksort in Haskell, I set myself two goals:

Implementing the same algorithm with a median of three to avoid bad performances on pre-sorted vectors;
Making a parallel version.

Here is the result (some minor pieces have been left for simplicity):

import qualified Data.Vector.Unboxed.Mutable as MV
import qualified Data.Vector.Generic.Mutable as GM

type Vector = MV.IOVector Int
type Sort = Vector -> IO ()

medianofthreepartition :: Vector -> Int -> IO Int
medianofthreepartition uv li = do
    p1 <- MV.unsafeRead uv li
    p2 <- MV.unsafeRead uv $ li `div` 2
    p3 <- MV.unsafeRead uv 0
    let p = median p1 p2 p3
    GM.unstablePartition (< p) uv

vquicksort :: Sort
vquicksort uv = do
    let li = MV.length uv - 1
    j <- medianofthreepartition uv li
    when (j > 1) (vquicksort (MV.unsafeSlice 0 j uv))
    when (j + 1 < li) (vquicksort (MV.unsafeSlice (j+1) (li-j) uv))

vparquicksort :: Sort
vparquicksort uv = do
    let li = MV.length uv - 1
    j <- medianofthreepartition uv li
    t1 <- tryfork (j > 1) (vparquicksort (MV.unsafeSlice 0 j uv))
    t2 <- tryfork (j + 1 < li) (vparquicksort (MV.unsafeSlice (j+1) (li-j) uv))
    wait t1
    wait t2

tryfork :: Bool -> IO () -> IO (Maybe (MVar ()))
tryfork False _ = return Nothing
tryfork True action = do
  done <- newEmptyMVar :: IO (MVar ())
  _ <- forkFinally action (\_ -> putMVar done ())
  return $ Just done

wait :: Maybe (MVar ()) -> IO ()
wait Nothing = return ()
wait (Just done) = swapMVar done ()

median :: Int -> Int -> Int -> Int
median a b c
        | a > b =
                if b > c then b
                        else if a > c then c
                                else a
        | otherwise =
                if a > c then a
                        else if b > c then c
                                else b

For vectors with 1,000,000 elements, I get the following results:

"Number of threads: 4"

"**** Parallel ****"
"Testing sort with length: 1000000"
"Creating vector"
"Printing vector"
"Sorting random vector"
CPU time:  12.30 s
"Sorting ordered vector"
CPU time:   9.44 s

"**** Single thread ****"
"Testing sort with length: 1000000"
"Creating vector"
"Printing vector"
"Sorting random vector"
CPU time:   0.27 s
"Sorting ordered vector"
CPU time:   0.39 s

My questions are:

Why are performances still decreasing with a pre-sorted vector?
Why does using forkIO and four thread fails to improve performances?

I'm about to go to bed, so no analysis right now, just what jumps out. When you're forking on every recursive call, you're creating a great lot of threads, the thread scheduling overhead overwhelms the actual work to be done. If there is even synchronisation between the different threads accessing the array involved, that would totally kill performance even for fewer threads. If you want a speedup, fork only for the first few recursive calls to not have more threads running than you have cores. — Daniel Fischer, Jul 28 '13 at 21:30
For fast parallelism you want to use `par`, not `forkIO`. See the `parallel` package [here](http://hackage.haskell.org/package/parallel-3.2.0.3) for more details. — Gabriella Gonzalez, Jul 28 '13 at 23:03
@GabrielGonzalez does `par` works well with computations which are "only" IO operations? Also, is it necessary to understand the Control.Parallel.Strategies module? — Simon Bergot, Jul 29 '13 at 08:29
@Simon The effectful counterpart to `par` is the `Par` monad, which is part of the `monad-par` package, which you can find [here](http://hackage.haskell.org/package/monad-par). Checkout the `Control.Monad.Par.IO` module. — Gabriella Gonzalez, Jul 29 '13 at 14:47
@jberryman no I haven't. I think I will take the opportunity to learn haskell profiling with this. I am still quite far from really understanding the execution model yet. — Simon Bergot, Jul 29 '13 at 15:37
Also, if you're using parallelism you have to take in consideration the number of cores your machine has. You don't want to overload them. You can get the number of cores using getNumCapabilities function from Control.Concurrent. — Jcao02, Jul 30 '13 at 22:28
@Jcao02 Haskell is supposed to use green threads, so that's only an issue if you're using an incomplete Haskell compiler. — Jeremy List, Jul 08 '15 at 03:51

score 1 · Answer 1 · answered Aug 22 '13 at 15:13

A better idea is to use Control.Parallel.Strategies to parallelize quicksort. With this approach you will not create expensive threads for every code that can be executed in parallel. You can also create a pure computation instead an IO.

Then you have to compile according to the number of cores you have: http://www.haskell.org/ghc/docs/latest/html/users_guide/using-concurrent.html

For an example, look at this simple quicksort on lists, written by Jim Apple:

import Data.HashTable as H
import Data.Array.IO
import Control.Parallel.Strategies
import Control.Monad
import System

exch a i r =
    do tmpi <- readArray a i
       tmpr <- readArray a r
       writeArray a i tmpr
       writeArray a i tmpi

bool a b c = if c then a else b

quicksort arr l r =
  if r <= l then return () else do
    i <- loop (l-1) r =<< readArray arr r
    exch arr i r
    withStrategy rpar $ quicksort arr l (i-1)
    quicksort arr (i+1) r
  where
    loop i j v = do
      (i', j') <- liftM2 (,) (find (>=v) (+1) (i+1)) (find (<=v) (subtract 1) (j-1))
      if (i' < j') then exch arr i' j' >> loop i' j' v
                   else return i'
    find p f i = if i == l then return i
                 else bool (return i) (find p f (f i)) . p =<< readArray arr i

main = 
    do [testSize] <- fmap (fmap read) getArgs
       arr <- testPar testSize
       ans <- readArray arr  (testSize `div` 2)
       print ans

testPar testSize =
    do x <- testArray testSize
       quicksort x 0 (testSize - 1)
       return x

testArray :: Int -> IO (IOArray Int Double)
testArray testSize = 
    do ans <- newListArray (0,testSize-1) [fromIntegral $ H.hashString $ show i | i <- [1..testSize]]
       return ans

Because the OS only sees one thread per CPU core, but those threads are internally running a more lightweight threading system (which doesn't have to consider paging, multi-user, etc) — Jeremy List, Jul 08 '17 at 23:58

Fast sorting in Haskell

1 Answers1

Linked