we are using the most recent Spark build. We have as input a very large list of tuples (800 Mio.). We run our Pyspark program using docker containers with a master and multiple worker nodes. A driver is used to run the program and connect to the master.
When running the program, at the line sc.parallelize(tuplelist) the program either quits with an java heap error message or quits without an error at all. We do not use any Hadoop HDFS layer and no YARN.
We have so far considered the possible factors as mentioned in these SO postings:
- Spark java.lang.OutOfMemoryError : Java Heap space
- Spark java.lang.OutOfMemoryError: Java heap space (also the list of possible solutions by samthebest did not helped to solve the issue)
At this points we have the following question(s):
- How do we know how many partitions we should use for the sc.parallelize step? What is here a good rule-of-thumb?
- Do you know any (common?) mistake which may lead to the observed behevior?