4

I recently posted a question about improving memory usage/GC and have since been able to reduce the memory consumption to what I believe is an appropriate/proportional level using ShortByteStrings (and in the process go from "never completing" to just "very, very slow" for current tests), but there still seems to be what I would consider excessive GC time.

Profiling a test results in the following output:

49,463,229,848 bytes allocated in the heap
68,551,129,400 bytes copied during GC
   212,535,336 bytes maximum residency (500 sample(s))
     3,739,704 bytes maximum slop
           602 MB total memory in use (0 MB lost due to fragmentation)

                                   Tot time (elapsed)  Avg pause  Max pause
Gen  0     14503 colls,     0 par    1.529s   1.681s     0.0001s    0.0164s
Gen  1       500 colls,     0 par   79.202s  79.839s     0.1597s    0.3113s

TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT    time    0.000s  (  0.001s elapsed)
MUT     time   29.500s  ( 82.361s elapsed)
GC      time   47.253s  ( 47.983s elapsed)
RP      time    0.000s  (  0.000s elapsed)
PROF    time   33.478s  ( 33.537s elapsed)
EXIT    time    0.000s  (  0.025s elapsed)
Total   time  110.324s  (130.372s elapsed)

Alloc rate    1,676,731,643 bytes per MUT second

Productivity  26.8% of total user, 22.7% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

And the following heap visualization:

Heap Visualization

Currently I'm using -H which seemed to have helped, and have experimented with combinations of increasing threads, generations, factor, and using compacting (-N -G -F -c) all of which resulted in no apparent change or a decrease in performance. It's clear that everything is long lived (when I increased -G then gen 1 statistics essentially moved to the oldest generation, with nothing between it and gen 0), but I don't understand why the GC can't just "leave it alone". From what I've read I thought the GC only runs when it's out of allocation space, but increasing the allocation/factor/heap seemed to have no effect.

Is there anything else I can try in order to decrease the GC effort? Or is there some way that I can know that there is a fundamental problem with my code that makes it impossible to decrease this time? I am and believe I must build up a large data structure in memory, currently a hashtable of mutable vectors inside ST. My only other thought is that internally my data structure is being [needlessly?] copied and forcing the GC to run, but I'm using ST with the expectation of avoiding that behavior.

Community
  • 1
  • 1
ryachza
  • 4,460
  • 18
  • 28
  • 1
    Can you include all the code (CSV.getFileRaw) and maybe the file you are testing with? – user2407038 Dec 31 '15 at 23:03
  • @user2407038 I updated the previous question with the full test code including a function to generate test data. The results shown here are actually for the full application which does a lot more work before/after the step that seems to be the problem. – ryachza Jan 01 '16 at 05:07

1 Answers1

3

I haven't looked at your code yet, so you should take this as an extended comment rather than an answer. If, as you say, you must build up a very large structure, there are two things you can do to limit the performance impact:

Limit pointers

The first and easiest (but not easy!) step is to cut down on pointers as much as you can. A gigantic HashMap is likely to be much more expensive to build than a gigantic hash table backed by, and indexed by, an unboxed vector. Unboxed things never have to be traced by the GC, a huge advantage, but they'll still get copied by it. You may find this acceptable, because copying large contiguous vectors should be really cheap.

Pin everything

The next step is to hide things from the garbage collector by putting them in pinned memory. I believe standard (not small) ByteStrings do this, as well as arrays created through the foreign function interface.

Community
  • 1
  • 1
dfeuer
  • 48,079
  • 5
  • 63
  • 167
  • I recently converted from `HashMap` in `unordered-containers` to `HashTable` in `hashtables` hoping migration to `ST` would gain something - it slightly decreased the run time, but didn't change GC time which overshadows everything. I've been looking into and would like to move to using unboxed values, but I'm not sure I can because of the data structures stored in the vectors. Regarding pinning, I recently moved from lazy bytestrings (pinned) to `ShortByteStrings` which didn't change the proportion of GC time, but *hugely* reduced the total heap size and slightly reduced the total run time. – ryachza Dec 31 '15 at 21:07
  • Is there any way to pin arbitrary data? Like somehow make the GC ignore the map/table? At a minimum doing so might provide further insight. – ryachza Dec 31 '15 at 21:08
  • 1
    @ryachza, I don't *think* that's possible. I was talking about crazy things like replacing a bunch of `ByteString`s with a single huge one and a vector of indices into it. Yucky stuff. – dfeuer Dec 31 '15 at 21:26
  • Ah interesting I had actually considered doing something like that when when I was comparing `Short` and `Plain` bytestrings - short took much less space, but plain was much faster. I was thinking I could try to get the best of both, but just went with short for now to get something done. Would you have any idea for proving that something like that would make a significant difference before doing it? It seems like it would be a lot of effort to "try out". – ryachza Dec 31 '15 at 21:40
  • @ryachza, try more detailed profiling, such as constructor profiling. Also, check again for other sources of trouble. I'll try to do so later, but I'm on a beach vacation with family. – dfeuer Dec 31 '15 at 22:22
  • 2
    In 2017, with GHC 8.2, it is now possible to pin arbitrary data using the interface provided by `Data.Compact`. – Andrew Thaddeus Martin Aug 07 '17 at 21:13