Isn't this a double traversal?

Question

In the "programming tips" section of the haskell wiki, I found this example:

count :: (a -> Bool) -> [a] -> Int
count p = length . filter p

This was said to be a better alternative to

count :: (a -> Bool) -> [a] -> Int
count _ [] = 0
count p (x:xs)
   | p x       = 1 + count p xs
   | otherwise =     count p xs

Which, in terms of readability, I entirely agree with.

However, isn't that a double traversal, and therefore actually worse than the explicit-recursion function? Does laziness in GHC mean that this is equivalent to a single traverse after optimisation? Which implementation is faster, and why?

The two traversals will likely be fused to one when you turn on optimisations in GHC. But, yeah, in principle you're right, though the second traversal is only over the shorter, filtered list, so it would often not matter that much anyway. — leftaroundabout, Apr 26 '15 at 11:09
The second version is not tail recursive and will probably lead to more memory being used than the first. You can use a (strict) accumulator instead to keep it in constant memory. Or you could use `foldl' (\c a -> if p a then succ c else c) 0`. — chi, Apr 26 '15 at 11:26
you might like the [guide to lazy evaluation](https://hackhands.com/guide-lazy-evaluation-haskell/) :D — Random Dev, Apr 26 '15 at 12:57

score 11 · Accepted Answer · edited May 23 '17 at 12:16

So to see what the optimizer actually does, let's look at the core:

% ghc -O2 -ddump-simpl Temp.hs
[1 of 1] Compiling Temp             ( Temp.hs, Temp.o )

==================== Tidy Core ====================
Result size of Tidy Core = {terms: 29, types: 26, coercions: 0}

Temp.count
  :: forall a_arN.
     (a_arN -> GHC.Types.Bool) -> [a_arN] -> GHC.Types.Int
[GblId,
 Arity=2,
 Caf=NoCafRefs,
 Str=DmdType <L,C(U)><S,1*U>,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=2, Value=True,
         ConLike=True, WorkFree=True, Expandable=True,
         Guidance=IF_ARGS [60 0] 191 0}]
Temp.count =
  \ (@ a_aMA)
    (p_arV :: a_aMA -> GHC.Types.Bool)
    (eta_B1 :: [a_aMA]) ->
    letrec {
      go_aNr [Occ=LoopBreaker]
        :: [a_aMA] -> GHC.Prim.Int# -> GHC.Types.Int
      [LclId, Arity=1, Str=DmdType <S,1*U>]
      go_aNr =
        \ (ds_aNs :: [a_aMA]) ->
          case ds_aNs of _ [Occ=Dead] {
            [] -> GHC.Types.I#;
            : y_aNx ys_aNy ->
              case p_arV y_aNx of _ [Occ=Dead] {
                GHC.Types.False -> go_aNr ys_aNy;
                GHC.Types.True ->
                  let {
                    g_a10B [Dmd=<L,C(U)>] :: GHC.Prim.Int# -> GHC.Types.Int
                    [LclId, Str=DmdType]
                    g_a10B = go_aNr ys_aNy } in
                  \ (x_a10C :: GHC.Prim.Int#) -> g_a10B (GHC.Prim.+# x_a10C 1)
              }
          }; } in
    go_aNr eta_B1 0

Cleaning it up a bit:

Temp.count :: forall aType.  (aType -> Bool) -> [aType] -> Int
Temp.count = \(@ aType) (p :: aType -> Bool) (as :: [aType]) ->
  letrec {
    go :: [aType] -> GHC.Prim.Int# -> Int
    go = \(xs :: [aType]) ->
      case xs of _ {
        [] -> I#; -- constructor to make a GHC.Prim.Int# into an Int
        : y ys ->
          case p y of _ {
            False -> go ys;
            True ->
              let {
                go' :: GHC.Prim.Int# -> Int
                go' = go ys 
              } in \(x :: GHC.Prim.Int#) -> go' (GHC.Prim.+# x 1)
          }
      }; 
  } in go as 0

Since we're operating on the unboxed type GHC.Prim.Int#, all the additions are strict, so we only have one loop through the data.

Abhay · Answer 2 · 2015-04-26T12:23:57.847

Either way, you have to perform one or two operations per item. The necessary one is the checking of the predicate. The second one of adding 1 depends on the result of the predicate.

So, if you don't consider the effects of cache etc., both cases generate equal number of operations.

The picture that one gets is that there are two separate traversals in the first case, one gathering the elements, and one counting them. Given a list bigger than the cache can handle, this will slow down the processing. And this actually happens in a strict language.

However Haskell's laziness comes into picture here. The list generated by filter is evaluated (comes into existence) element-by-element, as the counting function length needs it. And then as length uses them just for counting and does not preserve references to the newly created list, the elements are immediately free to be garbage-collected. So at any time, there is only O(1) memory occupied during the calculation.

There is a slight overhead of constructing the resultant "filtered" list in the first version. But this is usually negligible compared with cache effects which come in when the lists are large. (For small lists it may matter; this needs to be tested.) Further, it may be optimized away depending upon the compiler and the optimization level chosen.

Update. The second version will actually consume memory as is pointed out in one of the comments. For a fair comparison ,you need to write it with accumulating parameter and strictness annotator at the right place, because (I expect) length is already written this way.

Is my understanding correct? `filter` constructs the filtered list one-by-one as requested by `length` which in turn deconstructs it. This intermediate construction/deconstruction might be called "second traversal", but the compiler can optimize it away. — stholzm, Apr 26 '15 at 12:31
@stholzm Right. I would put it this way: the first and the second traversals are interleaved due to laziness. An optimizing compiler can remove the second one because all it is doing is constructing and destructing (and adding one to some other number). — Abhay, Apr 26 '15 at 12:40

score 2 · Answer 3 · answered Apr 26 '15 at 12:01

2

You can test which version is faster with profiling, e.g.:

module Main where


main :: IO ()
main = print countme'

count' :: (a -> Bool) -> [a] -> Int
count' p = length . filter p

count :: (a -> Bool) -> [a] -> Int
count _ [] = 0
count p (x:xs)
   | p x       = 1 + count p xs
   | otherwise =     count p xs


list = [0..93823249]

countme' = count' (\x -> x `mod` 15 == 0) list
countme = count (\x -> x `mod` 15 == 0) list

Then run ghc -O2 -prof -fprof-auto -rtsopts Main.hs and ./Main +RTS -p. It will yield the file Main.prof. Then change the main function to use countme instead and compare the results. Mine are:

4.12s for the implicit version
6.34s for the explicit version

If you turn off optimization, then the implicit one is still slightly (but not much) faster.

Apart from the fusion and laziness which was already explained by others, I guess another reason could be that length and filter are Prelude functions and can be better optimized by the compiler.

answered Apr 26 '15 at 12:01

hasufell

533
3
9

Thanks for the timings! Just wondering why the number 93823249? – Abhay Apr 26 '15 at 12:13
That is a random number. – hasufell Apr 26 '15 at 12:25
1

Why profiling? Profiling is useful to figure out *where* in a single program the time is spent. It's not a good tool for benchmarking. Better use a library such as `criterion` for benchmarking. – kosmikus Apr 26 '15 at 18:18
Do you have references why profiling isn't suitable to do basic benchmarking? I'm not a haskell guru. – hasufell Apr 26 '15 at 19:06

Isn't this a double traversal?

3 Answers3