Benchmarking Filter and Partition

Question

I was testing the performance of the partition function for lists and got some strange results, I think.

We have that partition p xs == (filter p xs, filter (not . p) xs) but we chose the first implementation because it only performs a single traversal over the list. Yet, the results I got say that it maybe be better to use the implementation that uses two traversals.

Here is the minimal code that shows what I'm seeing

import Criterion.Main
import System.Random
import Data.List (partition)

mypartition :: (a -> Bool) -> [a] -> ([a],[a])
mypartition p l = (filter p l, filter (not . p) l)



randList :: RandomGen g => g -> Integer -> [Integer]
randList gen 0 = []
randList gen n = x:xs
  where
    (x, gen') = random gen
    xs = randList gen' (n - 1)

main = do
  gen <- getStdGen
  let arg10000000 = randList gen 10000000
  defaultMain [
      bgroup "filters -- split list in half " [
        bench "partition100"         $ nf (partition (>= 50)) arg10000000
      , bench "mypartition100"       $ nf (mypartition (>= 50)) arg10000000
      ]
      ]

I ran the tests both with -O and without it and both times I get that the double traversals is better.

I am using ghc-7.10.3 with criterion-1.1.1.0

My questions are:

Is this expected?
Am I using Criterion correctly? I know that laziness can be tricky and (filter p xs, filter (not . p) xs) will only do two traversals if both elements of the tuple are used.
Does this has to do something with the way lists are handled in Haskell?

Thanks a lot!

You're using `criterion` pretty much correctly, but you should ideally use `env` to create the randomized list and pass it to the benchmarks, or at least be sure to force that list fully before benchmarking. — dfeuer, Jul 30 '16 at 17:35
Also, since this is a question about performance, you should tell us what version of GHC you're using. — dfeuer, Jul 30 '16 at 17:36
Thanks. I edited my question. I am using `ghc-7.10.3`. Do you know if there's any reason why the two traversals are faster in this case? I tested this also using a local implementation of lists since I thought it might something to do with ghc's implementation, but the results don't change! — aesadde, Jul 30 '16 at 18:04
In the stdlib `partition`, the computation of the first element causes the entire list to be retained in memory (and likewise for 2nd elem.) - in `mypartition`, the two can be computed independently, at the cost of computing the predicate at most twice as often. If you profile the memory usage of the two, `mypartition` allocates more memory than `partition`, but has a much lower maximum residency. In the case of one `Integer` comparison, the overhead of duplicated work in `mypartition` is *much* less than the overhead of retaining the list (and a 10m length list is not insignificant!) — user2407038, Jul 30 '16 at 20:37
Thanks! This makes a lot of sense. I had already seen the difference in memory usage but hadn't notice the maximum residency part. Indeed, I ran a few more benchmarks using more complex predicates and I start seeing now the expected behaviour of one traversal being better than two. — aesadde, Jul 30 '16 at 20:47
why not use `-O2`, which is what any performance critical code should be compiled with. — Thomas M. DuBuisson, Jul 30 '16 at 23:27
Yes, this is what I expect, and what I've observed in the past. Writing a good partition in Haskell is not possible, as far as I know. I could write one in assembly. :) — augustss, Jul 31 '16 at 01:12
@augustss, that's quite a teaser. Can you explain why? Does Wadler's GC trick fail to collect the pairs, or is something else going wrong? Is it something that can be fixed? — dfeuer, Jul 31 '16 at 02:16
By Wadler's GC trick you mean my GC trick. I'm the one who described it to Phil, but he's the one who thought it was worth a paper. I also think David A Turner invented it before me. But anyway, there should only ever be one pair allocated (the result) and one traversal. As either of the results are forced it will walk down the list and tack the unwanted elements to the end of the other result list. This can't be expressed in Haskell. — augustss, Jul 31 '16 at 02:32
The hbc/lml version of accumArray used the same trick to get good complexity. The trick was invented by Thomas Johnsson ca 1983. The partition function can be easily expressed in terms of accumArray. — augustss, Jul 31 '16 at 02:34
@augustss, yes, that one. I had no idea of the history there. Good to know. Do you have any thoughts about what would need to be added to the language (or even what absurdly dangerous primop could be added) to make these kinds of things work well? I envision some thread safety complications, as well as general "what do we really want?" issues. — dfeuer, Jul 31 '16 at 02:50
You need a thunk update primitive and some updatable references in closures. It could probably be done in some suitable monad. But it requires care, since you want to avoid write barriers where possible. I've not even thought of how to do it multi-threaded, but it might be possible with just a spin-lock in the main closure that does all the work. Maybe I should write a blog post. It's sad when old tricks like this one are forgotten. — augustss, Jul 31 '16 at 03:14
@augustss, if you could magic up something like that for GHC (and get it fast), we'll all be grateful. The next release of `Data.Sequence` will have a rather horribly complicated `fromList` by Lennart Spitzner designed to work around similar lazy pair inefficiency (although that one hurts even more because the GC trick doesn't even work for whatever reason). — dfeuer, Jul 31 '16 at 03:46
Thanks for the explanations! Wouldn't using a strict pair help? I ran the same experiments but with a map from `Data.Map` instead of lists and I always see the single traversal over the map being better than the two traversals, unlike what we see in the lists case. — aesadde, Jul 31 '16 at 09:08
@aesadde, aside from the semantic issue, a strict pair will lead to poor performance when the lists are sufficiently long. `Data.Map` is entirely strict in its structure, so it can't take advantage of laziness even for operations like `fmap` which would be happier with a lazy structure. — dfeuer, Jul 31 '16 at 16:07
@dfeuer thanks for the explanation! There's one thing that puzzles me still though. I've implemented a partition function using CPS but I don't see any improvements, it actually ends up being very inefficient but I thought that by avoiding tupling I'd get better results. Is this something expected as well? Also is there a paper or somewhere I can find more about the "GC Trick" ? — aesadde, Aug 01 '16 at 18:33
aesadde, [here's the paper](http://homepages.inf.ed.ac.uk/wadler/topics/garbage-collection.html) about the GC trick. CPS isn't particularly efficient in general; there are cases where it's great (or necessary), but other times you just end up allocating closures instead of data. — dfeuer, Aug 01 '16 at 19:47

score 5 · Accepted Answer · answered Aug 04 '16 at 17:09

There is no black or white answer to the question. To dissect the problem consider the following code:

import Control.DeepSeq
import Data.List (partition)
import System.Environment (getArgs)


mypartition :: (a -> Bool) -> [a] -> ([a],[a])
mypartition p l = (filter p l, filter (not . p) l)


main :: IO ()
main = do
  let cnt = 10000000
      xs = take cnt $ concat $ repeat [1 .. 100 :: Int]
  args <- getArgs
  putStrLn $ unwords $ "Args:" : args
  case args of
    [percent, fun]
      -> let p = (read percent >=)
         in case fun of
           "partition"      ->              print $ rnf $ partition   p xs
           "mypartition"    ->              print $ rnf $ mypartition p xs
           "partition-ds"   -> deepseq xs $ print $ rnf $ partition   p xs
           "mypartition-ds" -> deepseq xs $ print $ rnf $ mypartition p xs
           _ -> err
    _ -> err
  where
    err = putStrLn "Sorry, I do not understand."

I do not use Criterion to have a better control about the order of evaluation. To get timings, I use the +RTS -s runtime option. The different test case are executed using different command line options. The first command line option defines for which percentage of the data the predicate holds. The second command line option chooses between different tests.

The tests distinguish two cases:

The data is generated lazily (2nd argument partition or mypartition).
The data is already fully evaluated in memory (2nd argument partition-ds or mypartition-ds).

The result of the partitioning is always evaluated from left to right, i.e. starting with the list that contains all the elements for which the predicate holds.

In case 1 partition has the advantage that elements of the first resulting list get discarded before all elements of the input list were even produced. Case 1 is especially good, if the predicate matches many elements, i.e. the first command line argument is large.

In case 2, partition cannot play out this advantage, since all elements are already in memory.

For mypartition, in any case all elements are held in memory after the first resulting list is evaluated, because they are needed again to compute the second resulting list. Therefore there is not much of a difference between the two cases.

It seems, the more memory is used, the harder garbage collection gets. Therefore partition is well suited, if the predicate matches many elements and the lazy variant is used.

Conversely, if the predicate does not match many elements or all elements are already in memory, mypartition performs better, since its recursion does not deal with pairs in contrast to partition.

The Stackoverflow question “Irrefutable pattern does not leak memory in recursion, but why?” might give some more insights about the handling of pairs in the recursion of partition.

Benchmarking Filter and Partition

1 Answers1