17

I have been running a workflow on some 3 Million records x 15 columns all strings on my 4 cores 16GB machine using pyspark 1.5 in local mode. I have noticed that if I run the same workflow again without first restarting spark, memory runs out and I get Out of Memory Exceptions.

Since all my caches sum up to about 1 GB I thought that the problem lies in the garbage collection. I was able to run the python garbage collector manually by calling:

import gc
collected = gc.collect()
print "Garbage collector: collected %d objects." % collected

This has helped a little.

I have played with the settings of spark's GC according to this article, and have tried to compress the RDD and to change the serializer to Kyro. This had slowed down the processing and did not help much with the memory.

Since I know exactly when I have spare cpu cycles to call the GC, it could help my situation to know how to call it manually in the JVM.

architectonic
  • 2,871
  • 2
  • 21
  • 35

3 Answers3

2

I believe this will trigger a GC (hint) in the JVM:

spark.sparkContext._jvm.System.gc()

See also: How to force garbage collection in Java?

and: Java: How do you really force a GC using JVMTI's ForceGargabeCollection?

Paul Brannan
  • 1,625
  • 20
  • 20
1

You never have to call manually the GC. If you had OOMException it's because there is no more memory available. You should look for memory leak, aka references you keep in your code. If you releases this references the JVM will make free space when needed.

crak
  • 1,635
  • 2
  • 17
  • 33
  • 2
    Yes I have read this many times but I still think that my case is eligible for manual GC for several reasons: a. It was beneficial to call the Python GC since it considers the number of garbage objects rather than their size, b. The nature of my application involves stages where no computation takes place while waiting for a user decision, and c. What if I need to run some memory-intensive python functionality or a completely different application? I doubt that the JVM gc would account for that – architectonic Nov 14 '15 at 07:34
  • If you have to run memory-intensive functionality on the jvm ( if don't know for python ), the vm will use all the memory you all it to use and if it's need more crash ( because the jvm respect your wish ;). Call the gc when there is no computing can be seen as a good idea, but this gc will be a full gc and full gc are slow very slow. In all case if you encounter a Out of Memory Exceptions it's not GC problem! It's a code problem ! Either a need for more memory or a memory leak. – crak Nov 16 '15 at 17:03
  • Our experience is that we are getting OOMException when we _increase_ the size of the Java heap on the Driver/Master node. Our theory is that with the smaller size, GC is happening more frequently. – vy32 Jul 01 '19 at 21:13
  • 1
    It's a strange behavior. But indeed if you have less memory, it's will be filled quicker, so the gc will have to clean memory more frequently. Are you speaking about JVM OOM ? The driver memory should be keep low, the computation is made in worker. If you have a worker in the same serv than the driver, it's possible increase the memory of the driver limit the accessible memory of the worker leading to a OOM – crak Jul 02 '19 at 10:06
1

This is not yet possible, there are some tickets about executing "management task" on all executors:

But no completed yet.

You can try to call JVM GC when executing worker code, this will work. For exemple, when doing a RDD map, but I am sure with a right tuning you can get rid of OOM.

The most important setting is about the fraction you give between Java Heap and RDD cache memory: spark.memory.fraction, sometimes it's better to set to a very low value (such as 0.1), sometimes increase it.

More info at https://spark.apache.org/docs/2.2.0/tuning.html#memory-management-overview

Thomas Decaux
  • 21,738
  • 2
  • 113
  • 124