I recently posted a question about improving memory usage/GC and have since been able to reduce the memory consumption to what I believe is an appropriate/proportional level using ShortByteString
s (and in the process go from "never completing" to just "very, very slow" for current tests), but there still seems to be what I would consider excessive GC time.
Profiling a test results in the following output:
49,463,229,848 bytes allocated in the heap 68,551,129,400 bytes copied during GC 212,535,336 bytes maximum residency (500 sample(s)) 3,739,704 bytes maximum slop 602 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 14503 colls, 0 par 1.529s 1.681s 0.0001s 0.0164s Gen 1 500 colls, 0 par 79.202s 79.839s 0.1597s 0.3113s TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.000s ( 0.001s elapsed) MUT time 29.500s ( 82.361s elapsed) GC time 47.253s ( 47.983s elapsed) RP time 0.000s ( 0.000s elapsed) PROF time 33.478s ( 33.537s elapsed) EXIT time 0.000s ( 0.025s elapsed) Total time 110.324s (130.372s elapsed) Alloc rate 1,676,731,643 bytes per MUT second Productivity 26.8% of total user, 22.7% of total elapsed gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0
And the following heap visualization:
Currently I'm using -H
which seemed to have helped, and have experimented with combinations of increasing threads, generations, factor, and using compacting (-N -G -F -c
) all of which resulted in no apparent change or a decrease in performance. It's clear that everything is long lived (when I increased -G
then gen 1 statistics essentially moved to the oldest generation, with nothing between it and gen 0), but I don't understand why the GC can't just "leave it alone". From what I've read I thought the GC only runs when it's out of allocation space, but increasing the allocation/factor/heap seemed to have no effect.
Is there anything else I can try in order to decrease the GC effort? Or is there some way that I can know that there is a fundamental problem with my code that makes it impossible to decrease this time? I am and believe I must build up a large data structure in memory, currently a hashtable of mutable vectors inside ST
. My only other thought is that internally my data structure is being [needlessly?] copied and forcing the GC to run, but I'm using ST
with the expectation of avoiding that behavior.