0

do not I am using G1GC garbage collector for presto engine, with a 244Gb RAM node as slave with the following gc related configs.

-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCCause
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintGCDetails
-XX:+PrintClassHistogramAfterFullGC
-XX:+PrintClassHistogramBeforeFullGC
-XX:PrintFLSStatistics=2
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1
-Xloggc:/media/ephemeral0/presto/var/log/gc.log
-XX:+UseG1GC
-XX:+ExplicitGCInvokesConcurrent
-XX:+CMSClassUnloadingEnabled
-XX:+AggressiveOpts
-XX:PermSize=150M
-XX:MaxPermSize=150M
-XX:ReservedCodeCacheSize=150M
-Xmx210gb

I know some these properties apply to G1GC (CMS related ones). With this config, i am sometime seeing GC evacuation phase (young) pauses of 7-9 seconds when heap size grows to around 70 gb. I did some reading around it and tried to go through config properties, adding -XX:+DisableExplicitGC did not help. Any suggestions around tuning GC to reduce these timeouts will be really appreciated. GC log fragment of one such instance :

    2017-08-09T18:01:43.529+0000: 6125.293: [GC pause (G1 Evacuation Pause) (young) 6125.295: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 69969, predicted base time: 813.09 ms, remaining time: 0.00 ms, target pause time: 200.00 ms]
 6125.295: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 297 regions, survivors: 39 regions, predicted young region time: 5339.39 ms]
 6125.328: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 297 regions, survivors: 39 regions, old: 0 regions, predicted pause time: 6152.47 ms, target pause time: 200.00 ms]
, 6.7213894 secs]
   [Parallel Time: 6255.9 ms, GC Workers: 23]
      [GC Worker Start (ms): Min: 6125350.5, Avg: 6125355.7, Max: 6125361.1, Diff: 10.6]
      [Ext Root Scanning (ms): Min: 11.1, Avg: 304.8, Max: 501.2, Diff: 490.1, Sum: 7011.5]
      [Update RS (ms): Min: 0.0, Avg: 69.1, Max: 465.7, Diff: 465.7, Sum: 1589.2]
         [Processed Buffers: Min: 0, Avg: 33.4, Max: 376, Diff: 376, Sum: 768]
      [Scan RS (ms): Min: 173.8, Avg: 211.5, Max: 231.5, Diff: 57.7, Sum: 4865.2]
      [Code Root Scanning (ms): Min: 0.0, Avg: 3.1, Max: 51.1, Diff: 51.1, Sum: 70.6]
      [Object Copy (ms): Min: 5522.2, Avg: 5651.9, Max: 5992.8, Diff: 470.7, Sum: 129993.5]
      [Termination (ms): Min: 0.0, Avg: 5.8, Max: 6.9, Diff: 6.9, Sum: 133.9]
         [Termination Attempts: Min: 1, Avg: 31.7, Max: 44, Diff: 43, Sum: 730]
      [GC Worker Other (ms): Min: 0.9, Avg: 2.0, Max: 3.4, Diff: 2.5, Sum: 47.1]
      [GC Worker Total (ms): Min: 6243.1, Avg: 6248.3, Max: 6253.2, Diff: 10.1, Sum: 143710.9]
      [GC Worker End (ms): Min: 6131603.1, Avg: 6131604.0, Max: 6131604.9, Diff: 1.8]
   [Code Root Fixup: 1.5 ms]
   [Code Root Purge: 0.3 ms]
   [Clear CT: 302.2 ms]
   [Other: 161.6 ms]
      [Choose CSet: 32.7 ms]
      [Ref Proc: 25.8 ms]
      [Ref Enq: 0.8 ms]
      [Redirty Cards: 24.6 ms]
      [Humongous Register: 11.8 ms]
      [Humongous Reclaim: 3.8 ms]
      [Free CSet: 26.5 ms]
   [Eden: 9504.0M(9504.0M)->0.0B(9920.0M) Survivors: 1248.0M->832.0M Heap: 68.9G(210.0G)->60.2G(210.0G)]
 [Times: user=134.90 sys=6.83, real=6.72 secs]
2017-08-09T18:01:50.261+0000: 6132.025: Total time for which application threadspacket_write_wait: Connection to 54.225.198.1

I also looked into similar questions : Java garbage collector G1GC taking long for 'Object Copy' (Evacuation Pause) Java garbage collector G1GC taking long for 'Object Copy' (Evacuation Pause) While these provides insights, i do not find any recommendations for Xmx of around 200gb.

Ankit Dixit
  • 142
  • 11
  • Other young collections have lower pause times? Can you post a log entry for a fast one? – the8472 Aug 11 '17 at 17:48
  • Try setting min size to the max (-Xms210gb). Without that the VM has to grow the heap which can be very expensive and end up with suboptimal layout. – Dain Sundstrom Aug 11 '17 at 21:59
  • Did the -Xms210gb help you. – dark_src Aug 16 '17 at 01:22
  • yes, @dark_src -Xms210gb did help me in reducing the frequency. – Ankit Dixit Aug 16 '17 at 05:31
  • @DainSundstrom thanks, that helped in cache warm-up phase. I see a lot of gc triggers from https://github.com/prestodb/presto/blob/fe28b4181b0599f47315b5d6aed9ce29025f7fd2/presto-main/src/main/java/com/facebook/presto/server/CodeCacheGcTrigger.java#L69 Do you have any suggestions around that – Ankit Dixit Aug 16 '17 at 05:33
  • Well, the other question you linked to mention swap and THP, have you checked those? – the8472 Dec 31 '17 at 12:27

0 Answers0