Haskell list creation and concatenation performance

Question

In the process of playing with haskell and finding solution for project euler n°40 i found that this code is very fast:

 p = concat [show n | n <- [1..]]
dl x = p !! x

y = [10^a - 1| a<-[0..6]]
s = [digitToInt (dl b)::Int | b <-y]

but this one is extremly slow like a millions time slower

p = foldl1 (++) (map show [1..1000000])
dl x = p !! x

y = [10^a - 1| a<-[0..6]]
s = [digitToInt (dl b)::Int | b <-y]

can anybody explain to me why? thanks

Yes, if you use `foldl1`, you make concatenating *O(n^2)*, instead of *O(n)*, in order to boost performance, you should use `foldr`. — Willem Van Onsem, Dec 02 '18 at 18:07
Appending to the end of a list requires tho whole list to be reconstructed from scratch. You you better avoid that by means of some logic (append a long list to the end of a small list like @Willem Van Onsem suggested or use [Difference List](https://wiki.haskell.org/Difference_list). — Redu, Dec 02 '18 at 18:28

score 6 · Accepted Answer · answered Dec 02 '18 at 19:07

6

As already mentioned in the comments, your lists are associating in the worst possible way. If we look at the usual (and most sensible) definition of (++), we see

(++) :: [a] -> [a] -> [a]
[] ++ ys = ys
(x:xs) ++ ys = x : (xs ++ ys)

(Source)

So (++) is O(n) where n is the length of the first list only. It doesn't depend on the length of the second. When you do foldl1 (or any of the other foldl variants), the derivation looks something like...

foldl1 (++) [a, b, c, d]
==> ((a ++ b) ++ c) ++ d

Since we only look at the first argument to determine the complexity, if we let n be the sum of the lengths of a, b, c, d, then we're iterating over the elements of a three times, those of b twice, and c once. So we're doing, ostensibly, n operations n times, for a total of O(n^2) operations. (You can do the math of this rigorously. You'll end up with a sum resulting in a triangular number, and the triangular number formula is quadratic, which is where the n^2 comes from. Proof is left as an exercise to the reader.)

On the other hand, if we use foldr, then

foldlr (++) [] [a, b, c, d]
==> a ++ (b ++ (c ++ (d ++ [])))

Again, we only look at the left-hand argument every time. In this case, the arguments are organized nicely, so we only iterate over each element once, resulting in O(n) recursive steps.

Difference lists solve this in a different way, by "magically" reassociating your parentheses to be better placed. You can read more about those in the linked article.

The reason concat works better is that the people who wrote Haskell knew about these problems and knew to use foldr (or equivalent) to get the better performance.

answered Dec 02 '18 at 19:07

Silvio Mayolo

62,821
6
74
116

(++) by itself is O(1), it depends on how it's used. e.g. `head $ foldl1 (++) (map show [1..n])` [is linear in `n`](https://stackoverflow.com/a/14942678/849891). – Will Ness Dec 02 '18 at 20:05
@WillNess, that's true, but very sensitive to context. In many cases (such as this one), we are quite interested in the total amortized cost of an operation (the amount of work it does plus the amount of work it suspends for later). Sometimes we need to use a more detailed analysis, such as when mapping a function over an infinite list. – dfeuer Dec 02 '18 at 20:13
@dfeuer what I'm saying is, we can't say *anything* about a piece of code's cost by itself; must specify how it is used: as an argument to `head`, or `last`, or `flip const` is three very different use cases. – Will Ness Dec 02 '18 at 20:19
@WillNess, I was being a bit sloppy in my last message. Potentially infinite lists are a bit of a weird case, but consider something like sequences. You can get a nice upper bound on the cost of a sequence of `Seq` operations using the advertised amortized bounds of each operation. If you're doing something a bit weird, you may indeed be able to obtain a tighter bound by considering the detailed incremental behavior of functions like `fmap`, `liftA2`, `zipWith`, `<>`, `inits`, and `tails`. – dfeuer Dec 02 '18 at 20:30
@dfeuer I'm unfamiliar with `Seq` apart from its surface API. I guess the more strictness there is to a type, the more we can say about its performance upfront. seems logical. – Will Ness Dec 02 '18 at 20:37
@WillNess, well then, you should probably read some of the source! Some parts (Louis Wasserman's `replicateA`, `inits`, and `tails` and my `liftA2`, `*>`, `zipWith`, and `chunksOf` in particular) are rather interesting. I challenge you to come up with an incrementally optimal `<* ` to fill out the `Applicative` implementation. – dfeuer Dec 03 '18 at 00:08
@dfeuer thanks for the suggestions and for the challenge. :) Maybe after the New Years. :) – Will Ness Dec 03 '18 at 06:23
1

@WillNess, when you try it: `xs <* ys` should have a time and space complexity of `O (length xs * log (length ys))`, I believe, and it should be incremental just like `<*>`. Really it should be possible to adapt the ideas of `<*>` and `replicate` fairly directly, but the details twist my brain up in knots – dfeuer Dec 03 '18 at 18:36
Understanding "we only look at the left-hand argument every time" was key for me - what you're saying is that (++) only iterates its left-hand argument. The right-hand argument is just tacked on the end. In the `foldl` case, the left-hand argument in each parenthesis pair ends up getting bigger and bigger. In the `foldr` case it is only ever one of `a`, `b`, `c` or `d`. Remembering that lists are singly-linked lists in Haskell also helps here. – Heath Raftery Nov 22 '21 at 23:20

Haskell list creation and concatenation performance

1 Answers1