Efficient Folding Of A Sum Tree

Question

In this earlier question, I asked how to write a function that sums a non-binary integer tree, and several answers arose.

@Sibi said:

data Tree a = Empty | Node a [Tree a] deriving (Eq, Show)

addNums :: (Num a) => Tree a -> a
addNums Empty = 0
addNums (Node n []) = n
addNums (Node n (x:xs)) = n + (addNums x) + addNums (Node 0 xs)

@user3237465 said:

data Tree a = Empty | Node a [Tree a] deriving (Eq, Show, Foldable)

myNums :: (Num a) => Tree a
myNums = ...

main = print $ sum myNums

and @chi said:

addNums :: (Num a) => Tree a -> a
addNums Empty = 0
addNums (Node n xs) = n + sum (map addNums xs)

How does one go about finding the most efficient solution? Is there a native benchmarking tool in Haskell?

Why don't you remove the redundant `Empty` constructor as suggested in your previous question? It's annoying. — dfeuer, Dec 22 '15 at 03:08
@chi suggested that it had some value, as it was necessary to represent an empty tree. I'm just trying to cover all the bases. — Mark Karavan, Dec 22 '15 at 03:13
Unless you have a particular reason for wanting redundancy (and I don't think that *particular* redundancy has any value), the thing to do is `data Tree a = Node a [Tree a]` (which you can import from `Data.Tree`) and then either `data Tree' a = Empty | NonEmpty (Tree a)` or `newtype Tree' a = Tree' (Maybe (Tree a))`, or possibly even `type Tree' a = Maybe (Tree a)`. — dfeuer, Dec 22 '15 at 04:07
My advice: using `Data.Tree`, you probably can't do much better than `sumTree = foldl' (+) 0 . flatten`. You can improve that a bit, though, by manually fusing the operations. — dfeuer, Dec 22 '15 at 18:23
user3237465 and I had an [optimization battle](http://stackoverflow.com/questions/34406541/efficient-folding-of-a-sum-tree/34407525?noredirect=1#comment56612575_34407525). In cases of code sharing, Church encoding seems more efficient, although it's sort of indeterminate otherwise. Just something to consider. — PyRulez, Dec 24 '15 at 18:39

epsilonhalbe · Answer 1 · 2015-12-22T23:23:18.520

Although so.com is no site for recommendations I advise you to take a look at criterion https://hackage.haskell.org/package/criterion

I'll maybe give some example of it's usage tomorrow

If you really want to dive deep in that matter you can analyze the generated llvm assembler by adding the compiler option --ddump-llvm though that is a rather advanced topic only included for the sake of completeness.

Update - How to use `criterion` in this case

First of all I will explain this using the haskell stack tool, all of the code can be found at github/epsilonhalbe

First of all we create a project and split each of the relevant definitions in a separate module (otherwise we would need data Tree, data Tree' and data Tree''). See Chi.hs as an example:

module Chi where

data Tree a = Empty | Node a [Tree a] deriving (Eq, Show)

addNums :: (Num a) => Tree a -> a
addNums Empty = 0
addNums (Node n xs) = n + sum (map addNums xs)

myInts :: Tree Int
myInts =
    Node 1 [
           Node 2 [
             Node 4 [Empty], Node 5 [Empty]
           ],
           Node 3 [
             Node 6 [Empty], Node 7 [Empty], Node 8 [Empty]
           ]
        ]

myDouble :: Tree Double
myDouble =
    Node 1 [
           Node 2 [
             Node 4 [Empty], Node 5 [Empty]
           ],
           Node 3 [
             Node 6 [Empty], Node 7 [Empty], Node 8 [Empty]
           ]
        ]

Note: that for User3237465.hs we need a Language Pragma

{-# LANGUAGE DeriveFoldable #-}
module User3237465 where

data Tree a = Empty | Node a [Tree a] deriving (Eq, Show, Foldable)

addNums :: Num a => Tree a -> a
addNums = sum

myInts ..
myDouble ..

We build a Folder/File structure like the following ( this we get with stack new critExample and a bit of copying/renaming/deleting)

../haskell/critExample/
▾ src/
    Chi.hs
    Sibi.hs
    User3237465.hs
▾ bench/
    Benchmarks.hs
  critExample.cabal
  LICENSE
  Setup.hs
  stack.yaml

the contents of critExample.cabal also needs some adjustment,

name:                critExample
[... non-important stuff ...]

library
  hs-source-dirs:      src
  -- don't forget to adjust the exposed modules
  exposed-modules:     Chi
                 ,     Sibi
                 ,     User3237465
  build-depends:       base >= 4.7 && < 5
  default-language:    Haskell2010

-- and add the following benchmark part
benchmark addNums
  type:                exitcode-stdio-1.0
  hs-source-dirs:      bench
  main-is:             Benchmarks.hs
  build-depends:       base
                     , critExample
                     , criterion
  default-language:    Haskell2010
  [...]

then we can begin to write our benchmarks

Benchmarks.hs

module Main where

import Criterion
import Criterion.Main

import qualified Chi
import qualified Sibi
import qualified User3237465


main :: IO ()
main = defaultMain [
    bgroup "myInts" [ bench "Sibi"        $ whnf Sibi.addNums Sibi.myInts
                    , bench "Chi"         $ whnf Chi.addNums Chi.myInts
                    , bench "User3237465" $ whnf User3237465.addNums User3237465.myInts
                    ],

    bgroup "myDouble" [ bench "Sibi"        $ whnf Sibi.addNums Sibi.myDouble
                      , bench "Chi"         $ whnf Chi.addNums Chi.myDouble
                      , bench "User3237465" $ whnf User3237465.addNums User3237465.myDouble ]
    ]

Note that whnf only evaluates to weak head normal form, i.e. to the first constructor it sees - for a list this would be after the first element when it sees the (:) operator for tuples it wouldn't evaluate a thing, but for Int or Double it fully evaluates stuff. If you need 'deep' evaluation use nf instead of whnf - if you are not sure what is needed, try both whnf is usually unreasonably fast (like nanoseconds for ultra-long lists - as it only checks the head of that list).

You can build the project with stack build and then invoke the benchmarks with stack bench (triggers all available benchmarks) or stack bench critExample:addNums (useful if you have more than one benchmark suites and only want to run a specific one), usage is always projectname:name of benchmarks given in cabal-file.

If you want fancy html output (- and believe me you want it, because bryan o'sullivan put a lot of effort in it to make it sexy) you'll have to:

./.stack-work/dist/x86_64-linux/Cabal-1.22.4.0/build/addNums/addNums --output index.html

of course this path might vary if you do not use a linux operating system.

Update2

The results of the benchmarks - I don't know how representative they are - I ran them in a virtualized linux!

Running 1 benchmarks...
Benchmark addNums: RUNNING...
benchmarking myInts/Sibi
time                 616.7 ns   (614.1 ns .. 619.2 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 619.1 ns   (615.4 ns .. 626.8 ns)
std dev              17.09 ns   (9.625 ns .. 31.62 ns)
variance introduced by outliers: 38% (moderately inflated)

benchmarking myInts/Chi
time                 582.6 ns   (576.5 ns .. 592.1 ns)
                     0.998 R²   (0.996 R² .. 1.000 R²)
mean                 586.2 ns   (581.5 ns .. 595.5 ns)
std dev              21.14 ns   (11.56 ns .. 33.61 ns)
variance introduced by outliers: 52% (severely inflated)

benchmarking myInts/User3237465
time                 606.5 ns   (604.9 ns .. 608.2 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 607.0 ns   (605.5 ns .. 609.2 ns)
std dev              5.915 ns   (3.992 ns .. 9.798 ns)

benchmarking myInts/User3237465 -- folding variant see comments
time                 371.0 ns   (370.2 ns .. 371.7 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 372.5 ns   (370.8 ns .. 375.0 ns)
std dev              6.824 ns   (4.076 ns .. 11.19 ns)
variance introduced by outliers: 22% (moderately inflated)

benchmarking myDouble/Sibi
time                 678.9 ns   (642.3 ns .. 743.8 ns)
                     0.978 R²   (0.958 R² .. 1.000 R²)
mean                 649.9 ns   (641.1 ns .. 681.6 ns)
std dev              50.99 ns   (12.60 ns .. 105.0 ns)
variance introduced by outliers: 84% (severely inflated)

benchmarking myDouble/Chi
time                 643.3 ns   (617.4 ns .. 673.6 ns)
                     0.987 R²   (0.979 R² .. 0.996 R²)
mean                 640.6 ns   (626.7 ns .. 665.6 ns)
std dev              58.35 ns   (40.63 ns .. 87.82 ns)
variance introduced by outliers: 88% (severely inflated)

benchmarking myDouble/User3237465
time                 630.4 ns   (622.9 ns .. 638.5 ns)
                     0.997 R²   (0.994 R² .. 0.999 R²)
mean                 637.8 ns   (625.4 ns .. 659.8 ns)
std dev              53.15 ns   (33.46 ns .. 78.36 ns)
variance introduced by outliers: 85% (severely inflated)

benchmarking myDouble/User3237465 -- folding variant see comments
time                 398.1 ns   (380.7 ns .. 422.0 ns)
                     0.988 R²   (0.980 R² .. 0.996 R²)
mean                 400.6 ns   (389.1 ns .. 428.6 ns)
std dev              55.83 ns   (28.94 ns .. 103.6 ns)
variance introduced by outliers: 94% (severely inflated)

Benchmark addNums: FINISH
Completed all 2 actions.

As noted in the comments - another variant using import Data.Foldable (foldl') and addNums' = foldl' (+) 0 is significantly faster (thanks @User3237465!!)

Since you've done everything, could you also provide the output? In my version there should be `foldl' (+) 0` instead of `sum` — this should improve performance. — effectfully, Dec 22 '15 at 23:06
Don't forget my [code too](http://lpaste.net/147760) (me and user3237465 had an optimization battle in the comment section of my answer). — PyRulez, Dec 24 '15 at 19:18

PyRulez · Answer 2 · 2015-12-24T18:41:15.547

1

Actually, to increase efficiency, change your type. For folding purposes, you can't beat a Church Like Encoding. Might I recommend:

newtype Tree a = Tree {fold :: forall r. r -> (a -> [r] -> r) -> r}

Or even:

newtype Tree a = Tree {fold :: forall r. r -> (a -> ChurchList r -> r) -> r}

Or best yet:

newtype Tree a = Tree {fold :: forall tree list. tree -> (a -> list -> tree) -> list -> (tree -> list -> list) -> tree}

Church Encoding is more efficient because you do not have to traverse anything.

edited Dec 24 '15 at 18:41

answered Dec 22 '15 at 02:21

PyRulez

10,513
10
42
87

1

[Right folding is not efficient for summing integers](https://wiki.haskell.org/Foldr_Foldl_Foldl'). – effectfully Dec 22 '15 at 06:11
@user3237465 Those are regular integers. I'm talking about Church Encoded structures. – PyRulez Dec 22 '15 at 09:49
Church-encoded structures represent right folds. I guess with the current [call arity](http://www.joachim-breitner.de/publications/CallArity-TFP.pdf) analysis `foldl'` in terms of `foldr` perform optimally for Church-encoded lists as well as for usual, but call arity doesn't always work well for trees as described in the paper. So, what's you definition of `sum`? And with all fancy optimizations GHC performs, dealing with Church-encoded data looks like a premature optimization, useless in most cases. – effectfully Dec 22 '15 at 10:39
@user3237465 For the first `Tree`, `sumtree (Tree func) = func 0 (\a ls -> a + sum ls)`. Notice that neither `Tree` nor `sumtree` use recursion. Although it resembles a right fold, there is no `foldr` to speak of. `Tree` is just a function and `sum` is just function application. Since there is no recursion, GHC can optimize it a lot. See http://okmij.org/ftp/tagless-final/course/Boehm-Berarducci.html and https://www.reddit.com/r/haskell/comments/3urjm3/a_series_about_optimization/cxhzz7n . – PyRulez Dec 23 '15 at 00:59
@user3237465 Indeed, if you look at GHC's list code, most of its optimizations are just ways of *eliminating recursion*. See [build](https://hackage.haskell.org/package/base-4.8.1.0/docs/GHC-Exts.html#v:build) and augment (which are closely related to Church Lists). GHC converts everything to them because they are easier to optimize. – PyRulez Dec 23 '15 at 01:03
Compile [this](http://lpaste.net/147700) with `-O2` and observe a huge memory leak. Then try [this](http://lpaste.net/147701). The former is 3 times slower than the latter. Yes, Church-encoded data is just a function, but that doesn't mean it's efficient. – effectfully Dec 23 '15 at 07:53
@user3237465 My example uses no recursion. – PyRulez Dec 23 '15 at 11:58
`sumtree` is not recursive, but to construct a `Tree` you'll need either to write everything by hand, which is impossible for big trees, or use recursion (but the presense of recursion is irrelevant anyway). Again, Church-encoded datas are non-tail-recursive functions, which are inefficient for certain tasks like summing integers. [Here](http://lpaste.net/147708) is a code which runs in 9s on my slow machine, can you beat this using Church-encoded trees? – effectfully Dec 23 '15 at 12:38
@user3237465 [This](http://lpaste.net/147760) runs an order of magnitude faster, *with yours using `-O2` and mine using `runghc`*. (With both using `-O2`, mine runs three orders of magnitude faster.) Also, I constructed the tree neither by hand nor by recursion (and no foldr either). (GHC optimizations work best with lists (since it can optimize lists into Church List like structures), but not user defined types.) – PyRulez Dec 24 '15 at 03:28
@user3237465 (I could probably double its speed, since I have certain duplicated code that could be put into a `let` block.) – PyRulez Dec 24 '15 at 03:37
OK, I posed the challenge wrongly: one additional rule is "no sharing". My code is about summing integers from an arbitrary tree, yours is about bottom-up dynamic programming. And that's why you're able to generate a tree without recursion: because you can build the tree bottom-up, which is rarely a case (but even in this case `twentyfive` is written "by hand"). Can you generate a big tree in which all numbers are distinct and compute their sum? (I wrote a [version](http://lpaste.net/147776), but it leaks for some reason). Please, say if this conversation is annoying. – effectfully Dec 24 '15 at 12:26
@user3237465 Oh no, this conversation is fun. When I translate your code to church version, yours is slightly faster ([code](http://lpaste.net/147776)). Yet, if I optimize it, mine is about twice as fast ([code](http://lpaste.net/147787)). (Notice that the optimization I used for the second piece of code is only possible to church like structures.) Church Structures really kick in though when it comes to fusion and monads, which is why libraries like attoparsec use them. (Any other ideas for fold challenges?) – PyRulez Dec 24 '15 at 16:30
(Of course, there are operations for which Church structures are bad. See http://stackoverflow.com/questions/32288370/more-efficient-tail-of-church-encoded-list. I was only thinking about folding though.) – PyRulez Dec 24 '15 at 16:31
[Here](http://lpaste.net/147791) is an improved version of my code. It acts basically as yours. It looks like your code uses that `foldl`-in-terms-of-`foldr` technique I was talking about. And then you turn `foldl` into `foldl'` using `BangPatterns`. No more challenges, that's the only thing I wanted to say: you have to make your right-fold-like thing left-fold-like to sum integers. – effectfully Dec 24 '15 at 17:10
@user3237465 I changed it back to right fold (whatever that means) and tweaked it a little, and now it is faster again (http://lpaste.net/147796). – PyRulez Dec 24 '15 at 17:31
That's because of `Tree Int` which results in integer overflow. Change it to `Tree Integer` and you'll get the same performance as before. So at each step you clone the whole tree, shift all elements and make a new root. That's a "right fold", perhaps I have something to learn. – effectfully Dec 24 '15 at 18:11
Also, both our versions allocate a great amount of memory. Probably it's the case that they both are just inefficient. – effectfully Dec 24 '15 at 18:16
1

@user3237465 ([Here](http://lpaste.net/147799) is a way to improve it, but I'm not sure if it counts. XD) – PyRulez Dec 24 '15 at 18:38

Efficient Folding Of A Sum Tree

2 Answers2

Update - How to use criterion in this case

Update2

Update - How to use `criterion` in this case