I have been running a workflow on some 3 Million records x 15 columns all strings on my 4 cores 16GB machine using pyspark 1.5 in local mode. I have noticed that if I run the same workflow again without first restarting spark, memory runs out and I get Out of Memory Exceptions.
Since all my caches sum up to about 1 GB I thought that the problem lies in the garbage collection. I was able to run the python garbage collector manually by calling:
import gc
collected = gc.collect()
print "Garbage collector: collected %d objects." % collected
This has helped a little.
I have played with the settings of spark's GC according to this article, and have tried to compress the RDD and to change the serializer to Kyro. This had slowed down the processing and did not help much with the memory.
Since I know exactly when I have spare cpu cycles to call the GC, it could help my situation to know how to call it manually in the JVM.