Benefit of avoiding multiple list traversals

Question

I've seen many examples in functional languages about processing a list and constructing a function to do something with its elements after receiving some additional value (usually not present at the time the function was generated), such as:

Calculating the difference between each element and the average

(the last 2 examples under "Lazy Evaluation")
Staging a list append in strict functional languages such as ML/OCaml, to avoid traversing the first list more than once

(the section titled "Staging")

Comparing a list to another with foldr (i.e. generating a function to compare another list to the first)

listEq a b = foldr comb null a b
  where comb x frec [] = False
        comb x frec (e:es) = x == e && frec es
cmp1To10 = listEq [1..10]

In all these examples, the authors generally remark the benefit of traversing the original list only once. But I can't keep myself from thinking "sure, instead of traversing a list of N elements, you are traversing a chain of N evaluations, so what?". I know there must be some benefit to it, could someone explain it please?

Edit: Thanks to both for the answers. Unfortunately, that's not what I wanted to know. I'll try to clarify my question, so it's not confused with the (more common) one about creating intermediate lists (which I already read about in various places). Also thanks for correcting my post formatting.

I'm interested in the cases where you construct a function to be applied to a list, where you don't yet have the necessary value to evaluate the result (be it a list or not). Then you can't avoid generating references to each list element (even if the list structure is not referenced anymore). And you have the same memory accesses as before, but you don't have to deconstruct the list (pattern matching).

For example, see the "staging" chapter in the mentioned ML book. I've tried it in ML and Racket, more specifically the staged version of "append" which traverses the first list and returns a function to insert the second list at the tail, without traversing the first list many times. Surprisingly for me, it was much faster even considering it still had to copy the list structure as the last pointer was different on each case.

The following is a variant of map which after applied to a list, it should be faster when changing the function. As Haskell is not strict, I would have to force the evaluation of listMap [1..100000] in cachedList (or maybe not, as after the first application it should still be in memory).

listMap = foldr comb (const [])
  where comb x rest = \f -> f x : rest f

cachedList = listMap [1..100000]
doubles = cachedList (2*)
squares = cachedList (\x -> x*x)

-- print doubles and squares
-- ...

I know in Haskell it doesn't make a difference (please correct me if I'm wrong) using comb x rest f = ... vs comb x rest = \f -> ..., but I chose this version to emphasize the idea.

Update: after some simple tests, I couldn't find any difference in execution times in Haskell. The question then is only about strict languages such as Scheme (at least the Racket implementation, where I tested it) and ML.

you sure it's `listEq a b = foldr comb null b a` and not `listEq a b = foldr comb null a b`? Where is it from? — Will Ness, Dec 09 '12 at 10:40
about staging - I think it allows partial pre-compilation. Such pre-compiled functions embody some part of the work already done, in advance. In particular, the traversing of a list is done only once, all references to individual elements are pulled from it and stored in a compiled function, ready to use. Compiled functions aren't traversed, only interpreted functions would be traversed. — Will Ness, Dec 09 '12 at 11:14
About the "a b" vs "b a" you're right, it should be the other way around. Actually they can be removed from both sides, the reason I kept them was because ghci produced an error without them (when using "let listEQ = ..."). And if they are in reverse order the partial evaluation I was searching would not occur. — Alejandro Pulver, Dec 13 '12 at 22:41
"... would not occur" - exactly why I suggested it. You forgot to change it in your last edit though. :) — Will Ness, Dec 14 '12 at 18:56
a related Haskell link: http://www.haskell.org/haskellwiki/Runtime_compilation , mentioned [here](http://stackoverflow.com/a/898763/849891). Although, in Haskell we can't explicitly distinguish between a curried declaration and actual staging (compilation). Usually, if a *named* entity is declared it will be compiled as a separate memory resident entity; but it's a *compiler* thing (aka implementation detail). — Will Ness, Dec 20 '12 at 11:16

Don Stewart · Answer 1 · 2012-12-03T19:10:53.923

Executing a few extra arithmetic instructions in your loop body is cheaper than executing a few extra memory fetches, basically.

Traversals mean doing lots of memory access, so the less you do, the better. Fusion of traversals reduces memory traffic, and increases the straight line compute load, so you get better performance.

Concretely, consider this program to compute some math on a list:

go :: [Int] -> [Int]
go = map (+2) . map (^3)

Clearly, we design it with two traversals of the list. Between the first and the second traversal, a result is stored in an intermediate data structure. However, it is a lazy structure, so only costs O(1) memory.

Now, the Haskell compiler immediately fuses the two loops into:

go = map ((+2) . (^3))

Why is that? After all, both are O(n) complexity, right? The difference is in the constant factors.

Considering this abstraction: for each step of the first pipeline we do:

  i <- read memory          -- cost M
  j = i ^ 3                 -- cost A
  write memory j            -- cost M
  k <- read memory          -- cost M
  l = k + 2                 -- cost A
  write memory l            -- cost M

so we pay 4 memory accesses, and 2 arithmetic operations.

For the fused result we have:

  i <- read memory          -- cost M
  j = (i ^ 3) + 2           -- cost 2A
  write memory j            -- cost M

where A and M are the constant factors for doing math on the ALU and memory access.

There are other constant factors as well (two loop branches) instead of one.

So unless memory access is free (it is not, by a long shot) then the second version is always faster.

Note that compilers that operate on immutable sequences can implement array fusion, the transformation that does this for you. GHC is such a compiler.

I would gladly give another +1 for expanding an already clear and valid answer for extra clarity. — ljedrz, Dec 03 '12 at 20:33

score 16 · Answer 2 · answered Dec 03 '12 at 20:37

There is another very important reason. If you traverse a list only once, and you have no other reference to it, the GC can release the memory claimed by the list elements as you traverse them. Moreover, if the list is generated lazily, you always have only a constant memory consumption. For example

import Data.List

main = do
    let xs = [1..10000000]
        sum = foldl' (+) 0 xs
        len = foldl' (\_ -> (+ 1)) 0 xs
    print (sum / len)

computes sum, but needs to keep the reference to xs and the memory it occupies cannot be released, because it is needed to compute len later. (Or vice versa.) So the program consumes a considerable amount of memory, the larger xs the more memory it needs.

However, if we traverse the list only once, it is created lazily and the elements can be GC immediately, so no matter how big the list is, the program takes only O(1) memory.

{-# LANGUAGE BangPatterns #-}
import Data.List

main = do
    let xs = [1..10000000]
        (sum, len) = foldl' (\(!s,!l) x -> (s + x, l + 1)) (0, 0) xs
    print (sum / len)

Interestingly, you can use sparks to compute `sum` and `len` in parallel, also avoiding O(n) space. — Don Stewart, Dec 03 '12 at 20:50
@DonStewart Good point. I'd like to ask, can you rely on it? I mean, can you be sure that one spark won't get behind another one for some reason (or won't be started soon enough) and the gap will eventually take *O(n)* space? — Petr, Dec 03 '12 at 20:53
I think the operational behavior is non-deterministic, so you do not have guarantees of parallel progress. — Don Stewart, Dec 03 '12 at 20:56

Roman Cheplyaka · Answer 3 · 2012-12-09T17:51:07.523

Sorry in advance for a chatty-style answer.

That's probably obvious, but if we're talking about the performance, you should always verify hypotheses by measuring.

A couple of years ago I was thinking about the operational semantics of GHC, the STG machine. And I asked myself the same question — surely the famous "one-traversal" algorithms are not that great? It only looks like one traversal on the surface, but under the hood you also have this chain-of-thunks structure which is usually quite similar to the original list.

I wrote a few versions (varying in strictness) of the famous RepMin problem — given a tree filled with numbers, generate the tree of the same shape, but replace every number with the minimum of all the numbers. If my memory is right (remember — always verify stuff yourself!), the naive two-traversal algorithm performed much faster than various clever one-traversal algorithms.

I also shared my observations with Simon Marlow (we were both at an FP summer school during that time), and he said that they use this approach in GHC. But not to improve performance, as you might have thought. Instead, he said, for a big AST (such as Haskell's one) writing down all the constructors takes much space (in terms of lines of code), and so they just reduce the amount of code by writing down just one (syntactic) traversal.

Personally I avoid this trick because if you make a mistake, you get a loop which is a very unpleasant thing to debug.

Will Ness · Accepted Answer · 2012-12-14T19:07:58.510

2

So the answer to your question is, partial compilation. Done ahead of time, it makes it so that there's no need to traverse the list to get to the individual elements - all the references are found in advance and stored inside the pre-compiled function.

As to your concern about the need for that function to be traversed too, it would be true in interpreted languages. But compilation eliminates this problem.

In the presence of laziness this coding trick may lead to the opposite results. Having full equations, e.g. Haskell GHC compiler is able to perform all kinds of optimizations, which essentially eliminate the lists completely and turn the code into an equivalent of loops. This happens when we compile the code with e.g. -O2 switch.

Writing out the partial equations may prevent this compiler optimization and force the actual creation of functions - with drastic slowdown of the resulting code. I tried your cachedList code and saw a 0.01s execution time turn into 0.20s (don't remember right now the exact test I did).

edited Dec 14 '12 at 19:07

answered Dec 14 '12 at 18:59

Will Ness

70,110
9
98
181

When you say "partial compilation", I assume this isn't done on run-time (as opposite to some dynamic languages having a JIT). So you mean it only works when the list is known at compile-time, or it may even optimize reusing a list to be known at run-time? – Alejandro Pulver Dec 17 '12 at 16:43
@user1871626 there are systems capable of run-time compilation, like e.g. Common LISP. It is done explicitly there, e.g. having a function `fun` you can call `(compile 'fun)` (at REPL prompt, say) and it gets compiled, and the compiled code gets linked into the run-time. The staging technique would fit there nicely. Having being compiled, the function embodies pre-processed knowledge of the list, and applying (several times) thus compiled function is more efficient, as the book you linked to explains. – Will Ness Dec 17 '12 at 20:32

Benefit of avoiding multiple list traversals

4 Answers4