3

I am doing web crawling on a Sun server with 32 virtual processors and 32GB memory.

I opened 1460 threads to do the job for me. The runtime parameters I set were -Xms2048 and -Xmx2048. I have run the code twice, but it crashed at different points.

> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0xff390f38, pid=3053, tid=7640
> #
> # JRE version: 6.0_15-b03
> # Java VM: Java HotSpot(TM) Server VM (14.1-b02 mixed mode solaris-sparc )
> # Problematic frame:
> # C  [libc_psr.so.1+0xf38]  memset+0x78
> #
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> # The crash happened outside the Java Virtual Machine in native code.
> # See problematic frame for where to report the bug.
> #
> 

EDIT: I reduced the number of threads to 40 and ran it on the same server. It crashed again at the point the value of RSS exceeded the value of Swap(both of which were around 2150M). In other words, it crashed when the memory went beyond the limit. Then I ran it on my own PC with 4G RAM and dual core processor. To my surprise, it has been doing well so far. The memory usage on the PC is kept around 1.5G and a little far from the limit. It has been so steadily running that it seems there is a mechanism on the PC that prevents the memory from reaching its limit. In contrast, it seemed to go out of control on the Sun server.

EDIT: It hasn't crashed so far since I upgraded to the latest 64-bit Java.

Terry Li
  • 16,870
  • 30
  • 89
  • 134
  • 3
    Have tried reducing number of threads? 10, 20, 30... to see if it's a scaling problem? If every one of your threads is connecting the web it means at least 1460 TCP outgoing connections. And if they don't close main connection and retrieve other resources (HTML -> JS, CSS, IMG), they multiply a lot! You only have 2^16 available ports (65535). In fact a lot of them are already used. – helios Aug 20 '12 at 12:04
  • 1
    You're using `JRE version: 6.0_15-b03`. Try upgrading to the latest JRE, first. The latest upgrade version is `34`. [Things happened between `15` and `34`...](http://www.oracle.com/technetwork/java/javase/releasenotes-136954.html) – Lukas Eder Aug 20 '12 at 12:05
  • 3
    It's a scaling problem. The `memset` fails because the system cannot allocate space to back a mapping. (Unless this is a 32-bit application, in which case there's an obscure bug that can cause this when the low-order 32-bits of the size are all zero.) – David Schwartz Aug 20 '12 at 12:05
  • 1
    Have you considered using NIO instead of more than a thousand threads? The stack memory overhead alone, should be worth it. See also: http://stackoverflow.com/questions/592303/asynchronous-io-in-java – beny23 Aug 20 '12 at 12:30
  • For your edit, it seems the Sun JVM implementation has something that is not freeing resources. But I guess it only happens on certain situations. It should be a big bug otherwise. I'm totally in favor of a non-blocking API, using just as many threads as cores has your CPU, and using callback interfaces to handle responses. Then, each new event (new URL to fetch, new response to parse) is a new task you must assing to the queue of works... – helios Aug 23 '12 at 06:46
  • @LukasEder I think you were probably right. Thanks. – Terry Li Aug 31 '12 at 08:06

1 Answers1

1

Have you tried appending an 'M' to memory parameters? (i.e. -Xms2048M) Also, I would try setting ms to a smaller value (i.e. 1024M) in case the VM can't reserve enough space for the heap.

William
  • 331
  • 3
  • 3