17

Here's my implementation of a sort of treap (with implicit keys and some additional information stored in nodes): http://hpaste.org/42839/treap_with_implicit_keys

According to profiling data GC takes 80% of time for this program. As far as I understand, it's caused by the fact that every time a node is 'modified', each node on the path to the root is recreated.

Is there something I can do here to improve performance or I have to descend into the realm of ST monad?

Don Stewart
  • 137,316
  • 36
  • 365
  • 468
adamax
  • 3,835
  • 2
  • 19
  • 21
  • 4
    @adamax: this behavior (recreate everything up to the root) is common in immutable structures, have you read Purely Functional Data Structures by Chris Okasaki ? http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf he wrote several papers on this. – Matthieu M. Jan 07 '11 at 16:31
  • 1
    Perhaps you should verify this by running your program with `+RTS -s -RTS` as I can't see this 80% you speak of when I gave it a quick run using 7.0.1, I see about 16% time spent in GC. – ScottWest Jan 07 '11 at 17:21
  • @ScottWest: I compile it with ghc -O2 -prof --make test.hs and run with ./test +RTS -s -RTS, it says %GC time 77.4% (77.4% elapsed), the total time is 8.7 seconds. But my ghc version is 6.12.1. Just out of interest, what is the total time at your system? – adamax Jan 07 '11 at 17:53
  • @adamax: Try it without the -prof flag, the usages are much lower, I suspect. With the -prof -auto-all flags mine stack overflows. – ScottWest Jan 07 '11 at 17:56
  • @Mathhieu: I understand that this behaviour is inherent to functional data structures, but maybe I'm missing something and the true reason for the poor performance is the growth of stack or something? – adamax Jan 07 '11 at 18:01
  • @ScottWest: Oh, I thought -prof is required for runtime statistics. I tried it with -O2 but without -prof, it's 81.8%. Without -O2 it stackoverflows here as well. – adamax Jan 07 '11 at 18:07
  • @adamax, I suppose the last variable is the GHC version. The times for the successful (non-prof) run is about 6.5 seconds. – ScottWest Jan 07 '11 at 18:40
  • 1
    @adamax, I have about the same results as you. With ghc-7.0.1 and -O2, runtime is 8.027s and productivity is 16.8%. If I compile with -O2 -funfolding-use-threshold=256, runtime is 3.863s and productivity is 24.9%. I tried a few other options and INLINE'ing stuff, but this is the best I've got so far. – John L Jan 08 '11 at 00:11
  • I don't think you should be using strict Trees for the branches `left` and `right`, this may cause more copying than necessary. – luqui Jan 08 '11 at 01:00
  • @adamax, forgot to mention I had similar results with ghc-6.12.1. – John L Jan 08 '11 at 01:13
  • @luqui, non-strict Trees for the branches are slightly worse by my tests. The strictifications in the file are optimal AFAICT (although the `UNPACK` pragma is ignored on the branches). – John L Jan 08 '11 at 01:16
  • a small improvement: remove the `!`'s from the where-bound vars `first`, `second` in `merge`, and from `t` in `split`. I expected these would be irrelevant, but removing them saves about 0.5s. – John L Jan 08 '11 at 01:33
  • @adamax, is it possible to redefine `split` using a bottom-up traversal? I think this would be much more efficient. – John L Jan 08 '11 at 01:36
  • @John, by bottom-up traversal you mean avoiding a recursive call of split? I doubt it's possible. It's kind of a standard algorithm, I would be surprised if there were a better way. – adamax Jan 08 '11 at 12:51
  • Note that `UNPACK` has no effect on sum types i.e. using `UNPACK` on the `left` and `right` fields has no effect. – tibbe Mar 07 '11 at 08:07
  • 1
    Don't forget GHC's options for garbage collection: http://stackoverflow.com/questions/3171922/ghcs-rts-options-for-garbage-collection – Don Stewart Apr 17 '11 at 00:32

1 Answers1

26

Using GHC 7.0.3, I can reproduce your heavy GC behavior:

  $ time ./A +RTS -s
  %GC time      92.9%  (92.9% elapsed)
  ./A +RTS -s  7.24s user 0.04s system 99% cpu 7.301 total

I spent 10 minutes going through the program. Here's what I did, in order:

  • Set GHC's -H flag, increasing limits in the GC
  • Check unpacking
  • Improve inlining
  • Adjust the first generation allocation area

Resulting in a 10 fold speedup, and GC around 45% of time.


In order, using GHC's magic -H flag, we can reduce that runtime quite a bit:

  $ time ./A +RTS -s -H
  %GC time      74.3%  (75.3% elapsed)
  ./A +RTS -s -H  2.34s user 0.04s system 99% cpu 2.392 total

Not bad!

The UNPACK pragmas on the Tree nodes won't do anything, so remove those.

Inlining update shaves off more runtime:

 ./A +RTS -s -H  1.84s user 0.04s system 99% cpu 1.883 total

as does inlining height

 ./A +RTS -s -H  1.74s user 0.03s system 99% cpu 1.777 total

So while it is fast, GC is still dominating -- since we're testing allocation, after all. One thing we can do is increase the first gen size:

 $ time ./A +RTS -s -A200M
 %GC time      45.1%  (40.5% elapsed)
 ./A +RTS -s -A200M  0.71s user 0.16s system 99% cpu 0.872 total

And increasing the unfolding threshold, as JohnL suggested, helps a little,

 ./A +RTS -s -A100M  0.74s user 0.09s system 99% cpu 0.826 total

which is what, 10x faster than we started? Not bad.


Using ghc-gc-tune, you can see runtime as a function of -A and -H,

Time and GC

Interestingly, the best running times use very large -A values, e.g.

$ time ./A +RTS -A500M   
./A +RTS -A500M  0.49s user 0.28s system 99% cpu 0.776s
Don Stewart
  • 137,316
  • 36
  • 365
  • 468
  • Wow, that's incredible! I have a similar behaviour with ghc 6.12.1, the biggest speedup is given by inlining `update` and setting first gen size. One question, did you forget to include `size` option for -H? Just -H doesn't seem to do anything, while -H64m, for instance, does. – adamax Apr 17 '11 at 09:10
  • 1
    With GHC 7, -H increases all the limits for the GC – Don Stewart Apr 17 '11 at 16:28
  • 3
    More precisely, -H is a kind of "automatic -A", it increases the -A setting but without increasing the overall memory use. This is possible because we're doing copying GC, so between major GCs there's a lot of memory going unused. Increasing -A is not always a good idea - in some programs it will make things worse, due to increased cache misses. – Simon Marlow Apr 18 '11 at 21:34