0

My understanding was that spark will choose the 'default' number of partitions, solely based on the size of the file or if its a union of many parquet files, the number of parts.

However, in reading in a set of large parquet files, I see the that default # of partitions for an EMR cluster with a single d2.2xlarge is ~1200. However, in a cluster of 2 r3.8xlarge I'm getting default partitions of ~4700.

What metrics does Spark use to determine the default partitions?

EMR 5.5.0

Carlos Bribiescas
  • 4,197
  • 9
  • 35
  • 66

2 Answers2

0

spark.default.parallelism - Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

2X number of CPU cores available to YARN containers.

http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#spark-defaults

Looks like it matches non EMR/AWS Spark as well

Community
  • 1
  • 1
strongjz
  • 4,271
  • 1
  • 17
  • 27
  • I think that only applies for when you do something like sc.parallelize(), not when you're reading from s3. Or, if a single file is large, it'll split that as well. – Carlos Bribiescas Aug 08 '17 at 13:56
0

I think there was some transient issue because I restarted that EMR cluster with d2.2xlarge and it gave me the number of partitions I expected, which matched the r3.8xlarge, which was the number of files on s3.

If anyone knows why this kind of things happens though, I'll gladly mark yours as the answer.

Carlos Bribiescas
  • 4,197
  • 9
  • 35
  • 66