11

I'm writing some code that needs to frequently append to the end of a list. I know that using "++" is inefficient. So instead I build up the list backwards by appending to the head, and then reverse it when I'm done. I gather that this a common beginner tactic.

I would rather build it up in the correct order to begin with - but that means switching to a new data structure. I'm considering using Data.Sequence or Data.DList for my container. My list consists of strict int pairs, and I don't need random access to it. What are the relative merits of Data.Sequence and Data.DList, and are there other containers I should consider?

nont
  • 9,322
  • 7
  • 62
  • 82
  • 1
    I would use `Data.Sequence`, just because it has no external dependecies except `containers`, which is shipped with the standard platform. – fuz Jun 13 '11 at 13:58
  • 2
    @FUZxxl `DList` implementation is so trivial that it can be used without external dependencies (like `type DList a = [a] -> [a]`, `append = (.)`, `toList = ($[])`, `fromList = (++)` and so on). – Rotsor Jun 13 '11 at 15:51
  • Maybe you could post some of the code to let us know, what kind of solution you need. – fuz Jun 13 '11 at 16:05
  • See also http://stackoverflow.com/questions/3352418/what-is-a-dlist/6317209#6317209 for the origins of dlists. – Don Stewart Jun 13 '11 at 16:55

2 Answers2

27

Whether to use Data.Sequence or DList depends on how you are going to be using the resulting list. DList is great when you are building up a sequence, say in a Writer computation, to convert to a list at the end and use it. However, if you need to use the intermediate results, like, say:

f (foo ++ bar)
+ f (foo ++ bar ++ baz)
+ f (foo ++ bar ++ baz ++ quux)

then DList is pretty bad, because it needs to recompute the spine each time. Data.Sequence is a better choice in this situation. Data.Sequence is also better if you need to remove elements from the sequence.

But maybe you don't even need to make this decision. Reversing lists at the end of a computation is common in strict functional languages like ML and Scheme, but not in Haskell. Take, for example, these two ways of writing map:

map_i f xs = reverse $ go [] xs
    where
    go accum [] = accum
    go accum (x:xs) = go (f x : accum) xs

map_ii f [] = []
map_ii f (x:xs) = f x : map_ii f xs

In a strict language, map_ii would be horrible because it uses linear stack space, whereas map_i is tail recursive. But because Haskell is lazy, map_i is the inefficient one. map_ii can consume one element of the input and yield one element of the output, whereas map_i consumes the whole input before yielding any output.

Tail recursion isn't the holy grail of efficient implementation in Haskell. When producing a data structure like a list, you actually want to be co-recursive; that is, make the recursive call underneath an application of a constructor (eg. f x : map_ii f xs above).

So if you find yourself reversing after a tail-recursive function, see if you can factor the whole lot into a corecursive function.

Pi Delport
  • 10,356
  • 3
  • 36
  • 50
luqui
  • 59,485
  • 12
  • 145
  • 204
  • 1
    Very helpful answer, Thanks! I am just converting to a list at the end, and I don't need intermediate results. So it sounds like DList is the way for me to go in this case. – nont Jun 13 '11 at 15:15
  • This great answer is the most straightforward definition of corecursion I've seen. Thanks! – acfoltzer Jun 15 '11 at 05:44
  • 1
    Can you also mention in what situations would `DList` be a better choice? – kizzx2 Aug 04 '11 at 13:47
  • @kizzx2, see first paragraph. In the situations mentioned there, `Data.Sequence` is a waste. – luqui Aug 04 '11 at 16:20
1

I did a simple criterion comparison:

let mdata = replicate 1000 (replicate 100 (13 :: Int))
let f1 = foldl' (++) []
let fr = foldr (++) []
let f2 = foldl' (\s next -> s Seq.|> next) mempty
let f3 = foldl' (DL.snoc) mempty

defaultMain [
    bgroup "time encode" [ bench "++ - foldl"   $ nf (AE.encode . f1) mdata
                         , bench "++ - foldr"   $ nf (AE.encode . fr) mdata
                         , bench "Seq |>"  $ nf (AE.encode . toList . f2) mdata
                         ,  bench "DList snoc"  $ nf (AE.encode . DL.toList . f3) mdata]
  ]

With the result:

benchmarking time encode/++ - foldl
time                 604.1 ms   (570.0 ms .. 654.6 ms)
                     0.999 R²   (NaN R² .. 1.000 R²)
mean                 619.2 ms   (608.9 ms .. 625.6 ms)
std dev              9.761 ms   (0.0 s .. 11.21 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking time encode/++ - foldr
time                 2.554 ms   (2.503 ms .. 2.610 ms)
                     0.995 R²   (0.990 R² .. 0.998 R²)
mean                 2.584 ms   (2.547 ms .. 2.628 ms)
std dev              134.1 μs   (111.7 μs .. 186.6 μs)
variance introduced by outliers: 34% (moderately inflated)

benchmarking time encode/Seq |>
time                 2.020 ms   (1.986 ms .. 2.049 ms)
                     0.997 R²   (0.995 R² .. 0.998 R²)
mean                 2.106 ms   (2.078 ms .. 2.138 ms)
std dev              105.8 μs   (85.60 μs .. 130.5 μs)
variance introduced by outliers: 34% (moderately inflated)

benchmarking time encode/DList snoc
time                 1.992 ms   (1.952 ms .. 2.034 ms)
                     0.996 R²   (0.994 R² .. 0.998 R²)
mean                 2.030 ms   (2.000 ms .. 2.060 ms)
std dev              97.88 μs   (82.60 μs .. 122.3 μs)
variance introduced by outliers: 34% (moderately inflated)

Conclusion: Use Data.Sequence. It has the most flexibility while just a notch lower performance than DList - it's probably not worth it.

ondra
  • 9,122
  • 1
  • 25
  • 34