3

In the "programming tips" section of the haskell wiki, I found this example:

count :: (a -> Bool) -> [a] -> Int
count p = length . filter p

This was said to be a better alternative to

count :: (a -> Bool) -> [a] -> Int
count _ [] = 0
count p (x:xs)
   | p x       = 1 + count p xs
   | otherwise =     count p xs

Which, in terms of readability, I entirely agree with.

However, isn't that a double traversal, and therefore actually worse than the explicit-recursion function? Does laziness in GHC mean that this is equivalent to a single traverse after optimisation? Which implementation is faster, and why?

AJF
  • 11,767
  • 2
  • 37
  • 64
  • 5
    The two traversals will likely be fused to one when you turn on optimisations in GHC. But, yeah, in principle you're right, though the second traversal is only over the shorter, filtered list, so it would often not matter that much anyway. – leftaroundabout Apr 26 '15 at 11:09
  • 1
    The second version is not tail recursive and will probably lead to more memory being used than the first. You can use a (strict) accumulator instead to keep it in constant memory. Or you could use `foldl' (\c a -> if p a then succ c else c) 0`. – chi Apr 26 '15 at 11:26
  • 1
    you might like the [guide to lazy evaluation](https://hackhands.com/guide-lazy-evaluation-haskell/) :D – Random Dev Apr 26 '15 at 12:57

3 Answers3

11

So to see what the optimizer actually does, let's look at the core:

% ghc -O2 -ddump-simpl Temp.hs
[1 of 1] Compiling Temp             ( Temp.hs, Temp.o )

==================== Tidy Core ====================
Result size of Tidy Core = {terms: 29, types: 26, coercions: 0}

Temp.count
  :: forall a_arN.
     (a_arN -> GHC.Types.Bool) -> [a_arN] -> GHC.Types.Int
[GblId,
 Arity=2,
 Caf=NoCafRefs,
 Str=DmdType <L,C(U)><S,1*U>,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=2, Value=True,
         ConLike=True, WorkFree=True, Expandable=True,
         Guidance=IF_ARGS [60 0] 191 0}]
Temp.count =
  \ (@ a_aMA)
    (p_arV :: a_aMA -> GHC.Types.Bool)
    (eta_B1 :: [a_aMA]) ->
    letrec {
      go_aNr [Occ=LoopBreaker]
        :: [a_aMA] -> GHC.Prim.Int# -> GHC.Types.Int
      [LclId, Arity=1, Str=DmdType <S,1*U>]
      go_aNr =
        \ (ds_aNs :: [a_aMA]) ->
          case ds_aNs of _ [Occ=Dead] {
            [] -> GHC.Types.I#;
            : y_aNx ys_aNy ->
              case p_arV y_aNx of _ [Occ=Dead] {
                GHC.Types.False -> go_aNr ys_aNy;
                GHC.Types.True ->
                  let {
                    g_a10B [Dmd=<L,C(U)>] :: GHC.Prim.Int# -> GHC.Types.Int
                    [LclId, Str=DmdType]
                    g_a10B = go_aNr ys_aNy } in
                  \ (x_a10C :: GHC.Prim.Int#) -> g_a10B (GHC.Prim.+# x_a10C 1)
              }
          }; } in
    go_aNr eta_B1 0

Cleaning it up a bit:

Temp.count :: forall aType.  (aType -> Bool) -> [aType] -> Int
Temp.count = \(@ aType) (p :: aType -> Bool) (as :: [aType]) ->
  letrec {
    go :: [aType] -> GHC.Prim.Int# -> Int
    go = \(xs :: [aType]) ->
      case xs of _ {
        [] -> I#; -- constructor to make a GHC.Prim.Int# into an Int
        : y ys ->
          case p y of _ {
            False -> go ys;
            True ->
              let {
                go' :: GHC.Prim.Int# -> Int
                go' = go ys 
              } in \(x :: GHC.Prim.Int#) -> go' (GHC.Prim.+# x 1)
          }
      }; 
  } in go as 0

Since we're operating on the unboxed type GHC.Prim.Int#, all the additions are strict, so we only have one loop through the data.

Community
  • 1
  • 1
rampion
  • 87,131
  • 49
  • 199
  • 315
3

Either way, you have to perform one or two operations per item. The necessary one is the checking of the predicate. The second one of adding 1 depends on the result of the predicate.

So, if you don't consider the effects of cache etc., both cases generate equal number of operations.

The picture that one gets is that there are two separate traversals in the first case, one gathering the elements, and one counting them. Given a list bigger than the cache can handle, this will slow down the processing. And this actually happens in a strict language.

However Haskell's laziness comes into picture here. The list generated by filter is evaluated (comes into existence) element-by-element, as the counting function length needs it. And then as length uses them just for counting and does not preserve references to the newly created list, the elements are immediately free to be garbage-collected. So at any time, there is only O(1) memory occupied during the calculation.

There is a slight overhead of constructing the resultant "filtered" list in the first version. But this is usually negligible compared with cache effects which come in when the lists are large. (For small lists it may matter; this needs to be tested.) Further, it may be optimized away depending upon the compiler and the optimization level chosen.

Update. The second version will actually consume memory as is pointed out in one of the comments. For a fair comparison ,you need to write it with accumulating parameter and strictness annotator at the right place, because (I expect) length is already written this way.

Abhay
  • 768
  • 4
  • 13
  • Is my understanding correct? `filter` constructs the filtered list one-by-one as requested by `length` which in turn deconstructs it. This intermediate construction/deconstruction might be called "second traversal", but the compiler can optimize it away. – stholzm Apr 26 '15 at 12:31
  • @stholzm Right. I would put it this way: the first and the second traversals are interleaved due to laziness. An optimizing compiler can remove the second one because all it is doing is constructing and destructing (and adding one to some other number). – Abhay Apr 26 '15 at 12:40
2

You can test which version is faster with profiling, e.g.:

module Main where


main :: IO ()
main = print countme'

count' :: (a -> Bool) -> [a] -> Int
count' p = length . filter p

count :: (a -> Bool) -> [a] -> Int
count _ [] = 0
count p (x:xs)
   | p x       = 1 + count p xs
   | otherwise =     count p xs


list = [0..93823249]

countme' = count' (\x -> x `mod` 15 == 0) list
countme = count (\x -> x `mod` 15 == 0) list

Then run ghc -O2 -prof -fprof-auto -rtsopts Main.hs and ./Main +RTS -p. It will yield the file Main.prof. Then change the main function to use countme instead and compare the results. Mine are:

  • 4.12s for the implicit version
  • 6.34s for the explicit version

If you turn off optimization, then the implicit one is still slightly (but not much) faster.

Apart from the fusion and laziness which was already explained by others, I guess another reason could be that length and filter are Prelude functions and can be better optimized by the compiler.

hasufell
  • 533
  • 3
  • 9