OutOfMemoryError in Spark worker process leaves jvm in hanging state

Question

I faced this exception in spark worker node

    Exception in thread "dispatcher-event-loop-14" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.HashMap.newNode(HashMap.java:1747)
    at java.util.HashMap.putVal(HashMap.java:631)
    at java.util.HashMap.put(HashMap.java:612)
    at java.util.HashSet.add(HashSet.java:220)
    at java.io.ObjectStreamClass.getClassDataLayout0(ObjectStreamClass.java:1317)
    at java.io.ObjectStreamClass.getClassDataLayout(ObjectStreamClass.java:1295)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1480)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.rpc.netty.RequestMessage.serialize(NettyRpcEnv.scala:557)
    at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
    at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:520)
    at org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$sendToMaster(Worker.scala:638)
    at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:524)
    at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
    at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
    at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
    at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:216)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGTERM to handler- the VM may need to be forcibly terminated.

Just before this exception worker was repeatedly launching an executor as executor was exiting :-

EXITING with Code 1 and exitStatus 1

Configs:-

-Xmx for worker process = 1GB
Total RAM on worker node = 100GB
Java 8
Spark 2.2.1

When this exception occurred , 90% of system memory was free. After this expection the process is still up but this worker is disassociated with the master and is not processing anything.

Now as per https://stackoverflow.com/a/48743757 thread , what I understand is that worker process was facing OutOFMemory issue due to repeated submission of the executor. At this point some process sent SIGTERM to worker jvm and while handling this jvm faced OutOfMemory issue.

Which process could have sent SIGTERM?
Since there was enough system memory available why did OS or whichever process send the signal , shouldn't jvm exit by itself in case of OutOFMemory issue ?
When jvm was handling SIGTERM why did OutOfMemory occur ?
Why is the process still up?

You don't call System.gc() inside of methods? – Yeras_QazaQ Mar 06 '23 at 08:44 — Yeras_QazaQ, Mar 06 '23 at 08:44

Bill Mair · Answer 1 · 2023-03-04T19:10:27.733

0

OP confirmed that it isn't Kubernetes but a VM.

Total RAM on worker node = 100GB

worker node is the VM.

Have you tried upping the may heap size -Xmx3000m ?

edited Mar 04 '23 at 19:10

answered Mar 04 '23 at 18:34

Bill Mair

1,073
6
15

No it's not kubernetes. It's a vm. – Calypso Mar 04 '23 at 18:41
Maybe `-Xmx1000m` isn't enough then :-) – Bill Mair Mar 04 '23 at 18:50
Yes it seems because the executor process which worker is submitting is failing repeatedly and this might lead to growing heap space. But which process is sending SIGTERM and why is jvm not exiting . Why does it face outofmemory again while handling SIGTERM? Is it possible that it requires some heap space to handle SIGTERM too? – Calypso Mar 04 '23 at 18:53
My main concern is that I am trying to find the root cause of why the worker process is still up but it's not processing anything, it is not sending heartbeat to master. Why is jvm stuck . Normally when OOM occurs, jvm just exits. – Calypso Mar 04 '23 at 18:57
Saw the edit now. VM has 100GB memory. – Calypso Mar 04 '23 at 18:58
No I didn't try by increasing heap memory. – Calypso Mar 04 '23 at 19:44
While by increasing heap memory I will be able to mitigate but not find the root cause like which process has sent SIGTERM , why did jvm get stuck,etc. Although on trying to reproduce it with 1gb heap size only , this time it called shutdown hook manager and logged RECEIVED SIGNAL TWRM. So the difference this time is that when heap size increased and out of memory occurred , it was still able to gracefully handle the SIGTERM instead of throwing error while handling it. – Calypso Mar 04 '23 at 19:53
I think you have to look at you System Journal and find out if the image that you have selected has process limits at the operating system that you are violating and adjust them accordingly. The Linux OOMKiller just doesn't go around randomly killing processes. How much data are you trying to process? Does it even realistically fit into 100GB? Is that where your MX limit has to be? – Bill Mair Mar 04 '23 at 22:33

Queeg · Answer 2 · 2023-03-06T08:00:51.330

1. Which process could have sent SIGTERM?

Theoretically any process within user scope can send the signal. That is the same user running the JVM or the root user. It could be a daemon you use to watch over the responsiveness of the JVM (something like Apache Commons Daemon). How to trace where a signal came from is probably better asked on ServerFault or SuperUser, but there have been questions like How can I tell in Linux which process sent my process a signal.

2. Since there was enough system memory available why did OS or whichever process send the signal , shouldn't jvm exit by itself in case of OutOFMemory issue ?

We cannot tell why a SIGTERM was sent. But that signal is a request to shutdown, the JVM would respond by running the shutdown hooks and exit on it's own. Something that is difficult if it went into OutOfMemory already. The next thing you would usually send is a SIGKILL, which tells the OS to remove the process.

You can have an automatic trigger of that SIGKILL by following https://serverfault.com/a/826779

3. When jvm was handling SIGTERM why did OutOfMemory occur ?

That exact reason is to be found in the application you are running, combined with the events and data it needs to process. For server applications it is common to run shutdown code when they are going down.

Often the JVM - after hitting an OutOfMemoryError is in an invalid state. The reason is that some object could no longer be allocated, and while throwing the Error (and processing it in various catch closes, logging it to disk etc) it again needs to allocate memory which can easily fail as well.

4. Why is the process still up? The request to shutdown was not successfully processed by the JVM as it is in a nonhealthy state. So the process did not terminate itself, and the OS has no reason to force termination.

Next Steps

For operations it may be suitable to increase the total JVM memory and perform sanity reboots. The tricky part is that the code that needs memory often is not the culprit that has not released memory in time, so the error's stack trace may be misleading. You will have to analyze the memory usage, which works Using HeapDumpOnOutOfMemoryError parameter for heap dump for JBoss. Also consider activating https://serverfault.com/questions/826735/how-to-restart-java-app-managed-by-systemd-on-outofmemory-errors

For developers it may make sense to reduce total JVM memory to easily reproduce the situation and also analyze the logs and heap dumps from operations.

OutOfMemoryError in Spark worker process leaves jvm in hanging state

2 Answers2