1

we are using the most recent Spark build. We have as input a very large list of tuples (800 Mio.). We run our Pyspark program using docker containers with a master and multiple worker nodes. A driver is used to run the program and connect to the master.

When running the program, at the line sc.parallelize(tuplelist) the program either quits with an java heap error message or quits without an error at all. We do not use any Hadoop HDFS layer and no YARN.

We have so far considered the possible factors as mentioned in these SO postings:

At this points we have the following question(s):

  • How do we know how many partitions we should use for the sc.parallelize step? What is here a good rule-of-thumb?
  • Do you know any (common?) mistake which may lead to the observed behevior?
Daniyal
  • 885
  • 3
  • 16
  • 28
  • I personnaly use scala with spark, but I was in a dataworks convention where they talked about that the main issue with pyspark is that python memory lives in the java heap if I remember correctly, and when you use python you should be aware of your memory consumption and set the configurations correctly. Im not sure but to me it seems the error isnt duo to the number of partitions or anything like that but the error source is in the configurations you use. – Ilya Brodezki Jul 15 '19 at 11:58

1 Answers1

1
How do we know how many partitions we should use for the sc.parallelize step? What is here a good rule-of-thumb?

Ans: There are multiple factors to decide the number of partitions.

1) There may be cases where having number of partitions 3-4 X times of your cores will be good case(considering each partition going to be processing more than few secs)

2) Partitions shouldn't be too small or too large(128MB or 256MB) will be good enough

Do you know any (common?) mistake which may lead to the observed behevior?

Can you check the executor memory and disk that is available to run the size.

If you can specify more details about the job e.g. number of cores, executor memory, number of executors and disk available it will be helpful to point out the issue.

data_addict
  • 816
  • 3
  • 15
  • 32
  • We run our spark jobs with the following configuration: /spark/bin/spark-submit --master spark://spark-master:7077 --driver-memory 5g --executor-memory 7g --py-files path/to/file.py 2 20 0.5 "./data/40k_test.csv" So the Driver memory is 5GB and the executor memory 7GB. The workers have 2 cpus and 2GB of memory reservation and mem_limit in the Docker-compose.yml. Thanks for your help! – Daniyal Jul 15 '19 at 13:56
  • I don't see any issue with memory allocation, Probably having a look at the code, what kind of the machines(standard-2 or standard-4.....) along with the yaml file might give better idea of the issue. – data_addict Jul 16 '19 at 05:42