9

I have an application which uses rather a lot of actors: 25,000 to be precise. It uses Scala 2.7.7 and is running on jdk6_u18. It basically listens to and processes market data and has very little state.

It starts at 8.02am every day and within the hour it has crashed with an OutOfMemoryError. "Aha" you say, "you have a memory leak!" Except that when I restart it, it never, ever crashes again for the rest of the day! This despite ramping up both GC and CPU overhead when the US markets open at 2.30pm.

A few anecdotal findings:

  • it runs on Solaris. When I used to run it on Linux, it never crashed at all.
  • I have tried messing with generation heap sizing, allocating more memory etc. I think It makes no difference
  • The behaviour of the collector seems to differ when I have verbose:gc switched on

A few questions present themselves:

  1. Why would the behaviour of this program be different between Linux and Solaris?
  2. Why would the behaviour be different between starting it at 8.02 and starting it at 8.42?
  3. I have heard that the actors library had some memory-leak issues. What were they, when were they fixed and how would I discover if anything "similar" might be going on here? (Things to look for in jhat etc)
  4. Does anyone have an idea as to what might be going on?

I am now trying G1 to see if that makes any difference. I will update this question tomorrow!

Some output from G1 with verbose:gc on

I think I just caught it in the act:

600.290: [Full GC 255M->144M(256M), 1.5772616 secs]
602.084: [GC pause (young) 227M->145M(256M), 0.0556769 secs]
602.418: [Full GC 254M->144M(256M), 1.6415216 secs]
604.279: [GC pause (young) 227M->145M(256M), 0.0415157 secs]
604.602: [Full GC 255M->145M(256M), 1.6041762 secs]
606.422: [GC pause (young) 227M->145M(256M), 0.0237441 secs]
606.710: [Full GC 254M->145M(256M), 1.6022185 secs]

And then a bit later (you can see the full GC has taken longer and is reclaiming less)

849.084: [Full GC 254M->176M(256M), 1.9658754 secs]
851.191: [GC pause (young) 228M->176M(256M), 0.0218611 secs]
851.414: [Full GC 254M->176M(256M), 1.9352357 secs]
853.492: [GC pause (young) 228M->176M(256M), 0.0224688 secs]
853.716: [Full GC 254M->176M(256M), 1.9339705 secs]
855.793: [GC pause (young) 228M->176M(256M), 0.0215707 secs]
856.009: [Full GC 254M->176M(256M), 1.9805797 secs]
858.137: [GC pause (young) 228M->176M(256M), 0.0223224 secs]

Some output from G1 with verbose:gc off

It's fine again! *Sigh*

303.656: [GC pause (young) 225M->93M(256M), 0.1680767 secs]
308.060: [GC pause (young) 226M->94M(256M), 0.1793724 secs]
312.746: [GC pause (young) 227M->93M(256M), 0.1674851 secs]
316.162: [GC pause (young) 227M->95M(256M), 0.1826145 secs]
320.147: [GC pause (young) 226M->94M(256M), 0.1656664 secs]
325.978: [GC pause (young) 226M->93M(256M), 0.1475760 secs]
330.176: [GC pause (young) 226M->94M(256M), 0.1727795 secs]

and much, much later it is still OK!

25882.894: [GC pause (young) 224M->125M(256M), 0.2126515 secs]
25884.880: [GC pause (young) 224M->126M(256M), 0.2059802 secs]
25887.027: [GC pause (young) 224M->125M(256M), 0.1851359 secs]
25889.940: [GC pause (young) 223M->126M(256M), 0.2046496 secs]
25891.567: [GC pause (young) 224M->126M(256M), 0.1600574 secs]

and later still, a full GC

37180.191: [GC pause (young) 225M->154M(256M), 0.1716404 secs]
37182.163: [GC pause (young) (initial-mark) 225M->153M(256M)37182.326: [GC concurrent-mark-start], 0.1622246 secs]
37183.089: [GC concurrent-mark-end, 0.7635219 sec]
37183.090: [GC remark, 0.0032547 secs]
37183.093: [GC concurrent-count-start]
37183.297: [GC concurrent-count-end, 0.2043307]
37183.393: [GC cleanup 198M->198M(256M), 0.0068127 secs]
37183.400: [GC concurrent-cleanup-start]
37183.400: [GC concurrent-cleanup-end, 0.0000393]
37183.648: [GC pause (young) 222M->153M(256M), 0.1483041 secs]
37184.235: [GC pause (partial) 171M->91M(256M), 0.2520714 secs]
37187.223: [GC pause (young) 221M->92M(256M), 0.1721220 secs]

UPDATE

Well, since switching to the G1 garbage collector on jdk1.6.0_18 the application has behaved itself on three consecutive days. I suspect that Erik is correct in his analysis of the situation about the VM getting itself into a death spiral in high-throughput cases where it has promoted objects into the tenured generation.

Community
  • 1
  • 1
oxbow_lakes
  • 133,303
  • 56
  • 317
  • 449
  • You should first try to gather more information about the objects which are in memory when the OutOfMemoryException occurs using -XX:+HeapDumpOnOutOfMemoryError – stacker Feb 02 '10 at 10:35
  • Yes - I've done this but I have difficulty running jhat on a heap dump of 256Mb! – oxbow_lakes Feb 02 '10 at 10:56
  • 1
    Did you try other analyzer tools like: http://www.alphaworks.ibm.com/tech/heapanalyzer , there are a few more recommended in other SO questions. – stacker Feb 02 '10 at 11:17
  • When you switched from Linux to Solaris did you inadvertently switch from 32bit to 64bit? That can make a surprisingly large difference. – Erik Engbrecht Feb 02 '10 at 12:51
  • Hi Erik - no, they are both 32-bit. The main difference is that the JVM on the linux box is much older (1.6.0_03) than the versions I've tried on solaris (1.6.0_13 and 1.6.0_18) – oxbow_lakes Feb 02 '10 at 18:20
  • @stacker - I might have a look at something like this tomorrow. I can't analyze my 256M heap using `jhat` even running it with 1.5G of memory! – oxbow_lakes Feb 02 '10 at 18:48

3 Answers3

4

Do you have any reason to expect your heap to slowly grow in size? It looks like in both traces it is growing. One thing I've done on many occasions is reduce my heap to try to make the problem worse. However, 256M is about the lower limit for Scala.

One thing I've noticed before is that if you have short-lived objects that make it out of the eden generation due to too much pressure it can gradually kill you. This can happen when there's a burst of activity (maybe you have a burst in the morning?) and your eden space isn't big enough. Scala in general and actors in particular make heavy use of small short-lived objects, and it seems there's this magic threshold that once you cross it's all down hill. So one run will be fine and the next crashes and burns.

Another thing I've noticed before is that the GC settings that work well on OSX/x86 often do not work well on Sparc/Solaris and vice versa. If you're on CoolThreads, I suggest you configure the GC to have one GC thread per thread in your scheduler/pool.

Which brings up another thing - make sure the scheduler isn't creating new threads willy-nilly. It will do that sometimes. I'd say you should almost always manually set a cap on the threads. I don't know how relevant it is, but an interesting factoid about the fork-join scheduler that actors use by default is that it is intended for short, CPU bound tasks. Doing IO in the threads it manages screws up its assumptions. Of course it should still work, but...

Good luck! I've lost many, many days of my life to such problems.

Take a look at some of the options here: http://java.sun.com/performance/reference/whitepapers/tuning.html

It looks like you are using the concurrent-mark-sweep collector. Try setting:

-XX:+UseParallelGC
Erik Engbrecht
  • 3,174
  • 17
  • 23
  • Hi Erik - No - I've added a third trace. The second example was quite stable and still going strong after 6 hours. A full GC would bring the memory down belong 100Mb whereupon it would slowly creep back up to 150Mb before a full GC kicked in again – oxbow_lakes Feb 03 '10 at 07:21
  • I've set the max pool size of the actor thread pool to be 8. Your point about short bursts of creation in the morning is probably correct but I can't quite understand it. When markets open at 8am obviously there's a rampimg up of messages. These stay roughly constant until the US opens at 2.30pm when activity doubles. I was thinking something along the lines of "my app can't grow the heap fast enough" so I set the Xms/mx both to 256Mb and scheduled the app to start at 8.02am but this hasn't worked. If the move to G1 does not prove fruitful, I'll look at changing Eden rations in the old GC – oxbow_lakes Feb 03 '10 at 07:28
  • GC performance can vary widely depending on what optimizations HotSpot has performed on the code. If you are hitting a cold VM with a high load that then tapers off that may stress it too much. By the time US markets open the VM may be warmed up and better able to handle the load. – Erik Engbrecht Feb 03 '10 at 12:54
  • I have looked around for tips on good VM options to use in certain situations without much success. The general opinion seems to be *don't use any bespoke options; let the VM decide what to do itself* – oxbow_lakes Feb 03 '10 at 15:47
  • @Erik - I have other actor-based applications which run happily with <10Mb of heap. Can you expand on what you mean by "Scala needs 256Mb?" – oxbow_lakes Feb 03 '10 at 16:01
  • I just find Scala works better with more memory. 256M is not a scientific number, but rather that I find if I don't want to be bothered with tuning setting the heap at 512m or often 1g works better. It gives HotSpot more breathing room. – Erik Engbrecht Feb 04 '10 at 13:54
1

At this point, I wonder if it would be worth trying to replace your actors with Lift's or Akka? Though I think it is unlikely for it to be a bug with them, they might not tickle whatever it is that causes the crash.

Daniel C. Sobral
  • 295,120
  • 86
  • 501
  • 681
0

From my point heap layout is wrong. Old space varies from 93M in the beginning to the 176 at the high load. as i see from logs you have generation is about ~50mb/sec on average. So if you have pause about 2 seconds on full gc at point of 176mb, you won't have young space to create new objects. My advices:

  • Check survivor spaces settings - your old space grows to 176M. reducing new space - obviously, it's not by intention
  • Set NewSize explictly to desired value, say 128M

  • increase overall heap size to promote objects instead of doing full gc

  • you have very long pauses: gc 80mb>50 ms, ~120mb>150ms. Try CMS instead, i believe it will be faster.(but do benchs)