Avoid list concatenation during in-order binary tree traversal

Question

Haskell beginner here: Inorder traversal of a binary tree is simple, e.g.:

data IntegerTree = Leaf Integer
                 | Node IntegerTree Integer IntegerTree

inorder :: IntegerTree -> [Integer]
inorder (Leaf n)     = [n]
inorder (Node l n r) = inorder l ++ [n] ++ inorder r

However, it seems to me that there must a more efficient implementation. Since lists are singly-linked, concatenating inorder l and [n] seems wasteful, especially since this operating is performed many times for a large tree. Can I avoid this problem by writing the same function differently?

I initially thought about this while trying to solve the towers of hanoi puzzle where a move list is constructed in a similar fashion and I expect that many problems can be solved with similar recursive algorithms.

Except for heavily left-biased trees, I think `inorder l ++ (n : inorder r)` would go a long ways toward avoiding the worst-case behavior for `(++)`. — chepner, Dec 22 '20 at 18:53
As far as I've been able to determine, `ghc` always optimizes `l ++ [x] ++ r` to `l ++ x:r`, even at `-O0`, so I've gotten into the habit of writing the former, since it's usually clearer to read. — K. A. Buhr, Dec 22 '20 at 19:53

score 5 · Answer 1 · answered Dec 22 '20 at 20:55

Nested ++, as the ones you have in your in-order visit, can be inefficient. This is because list1 ++ list2 will copy (the spine of) list1, as follows:

(1:2:3:[]) ++ list2
=
1: ((2:3:[]) ++ list2)
=
1:2: ((3:[]) ++ list2)
=
1:2:3: ([] ++ list2)
=
1:2:3:list2

Copying the first list might not be so bad if done once, but when we nest ++ as in

((list1 ++ list2) ++ list3) ++ list4

we copy the copy of the copy, which slows everything down, often making O(N^2) something that should be O(N).

When computing list1 ++ list2, a key idea is this one: if we only could keep a "pointer" to the end [] inside list1, we could avoid the copy, and simply rewrite it with a pointer to list2, we would obtain constant-time (destructive) append.

Now, do we have imperative-style mutability in Haskell? Not for regular lists. We could, however, transform out lists into "functions of the end", i.e. instead of writing

1:2:3:[]

for a list, we could write

\end -> 1:2:3:end

to represent the same data. The latter representation of lists is called a "difference list". Converting from regular to difference lists is simple:

type DList a = [a] -> [a]

toDList :: [a] -> DList a
toDlist = (++)

fromDlist :: DList a -> [a]
fromDlist dl = dl []

So far so good, but how to concatenate two DList a? We need to take something like

list1 = \end -> 1:2:3:end
list2 = \end -> 4:5:6:7:end

and return

concatenate list1 list2 = \end -> 1:2:3:4:5:6:7:end

It turns out that concatenate is simply function composition, (.). I.e. list1 . list2 is exactly the concatenation we need. When we evaluate

fromDList (list1 . list2)
-- i.e.
(list1 . list2) []

no copy is done, since the end of list1 is immediately linked to list2.

So, with this in mind, we can rewrite your code with minimal changes:

inorder :: IntegerTree -> [Integer]
inorder tree = fromDList (inorderDL tree)

inorderDL :: IntegerTree -> DList Integer
inorderDL (Leaf n)     = (n :)               -- that is, \end -> n: end
inorderDL (Node l n r) = inorderDL l . (n :) . inorderDL r

This will make no list copies, since when each sublist is generated at each recursive step, its tail won't be [], but the "pointer" to the next list to concatenate.

Finally, note that this approach, under a different light, is exactly what Willem used in his answer.

there is no copying if the list is used ephemerally. still it will be quadratic on left-skewed degenerate trees. this [answer to "Why are difference lists more efficient than regular concatenation"](https://stackoverflow.com/a/13879693/849891) gives an explanation. the Richard Bird's trick (seen in the other answer), as also the DList approach in this answer (or even replacing the `++`s with [explicitly manipulated data structure](https://stackoverflow.com/a/43945275/849891)), all actually rediscover the old [`gopher`](http://www.dreamsongs.com/10ideas.html) technique from John McCarthy. — Will Ness, Dec 22 '20 at 23:08
@WillNess Uhm, the answer you cite says that the total cost is `O(number of lists + sum (map length lists))` which is O(N) in this example, no matter what the tree is. Am I missing something here? The answer does say that, if you encapsulate _arbitrary functions_ `[a]->[a]` in DLists then you can get a quadratic performance back, e.g. if we encapsulate `(++ list)` instead of `(list ++)`, but that does not apply here, AFAICS (?) — chi, Dec 22 '20 at 23:31
yes of course O(N) is correct (for DLs that is, it is quadratic for the ++ trees), it just explains *why* with great attention to detail instead of referring to "copying" (which is pretty vague under lazy evaluation). the key insight being the restructuring of the (.) tree inside a DL into the ($) list after the first application of the composed DL to the `[]` sentinel (which I mention in my answer; this is how I understood it). re the potential quadratic behaviour, I don't think that applies here, yes. — Will Ness, Dec 22 '20 at 23:37
i.e. what the answer shows is that the problem with the left-nested `++` tree is not that `++` copies lists, it copies nothing because Haskell is lazy; the problem is that its structure is fixed and so `head` after `tail` is again an O(N) operation, but with the DL the `tail` has rotated the structure to the right so that `head` after `tail` is O(1) (and same with explicit data and explicit rotations, as suggested in my linked answer). — Will Ness, Dec 23 '20 at 07:00

score 4 · Accepted Answer · answered Dec 22 '20 at 18:54

You can pass an extra parameter which is the tail: a list of items after that item that still will be generated. This then looks like:

inorder :: IntegerTree -> [Integer]
inorder = go []
    where go tl (Leaf n) = n : tl
          go tl (Node l n r) = go (n : go tl r) l

Here tl is thus a list of elements that will follow after the end of the node. At the top level, that list is empty (hence the go []). When we see a leaf, we emit the item n wrapped in the Leaf, and then followed by the tail.

For the Node we thus will perform recursion on the left element, with as tail n : go tl r, this is thus the item of that node followed by the recursion on the right subtree where we use the given tail tl again.

N.B. this pattern is known as a "difference list", and is available (as a newtype that "feels" like a list) in the [dlist](https://hackage.haskell.org/package/dlist) package @Peter — luqui, Dec 23 '20 at 01:37

Avoid list concatenation during in-order binary tree traversal

2 Answers2