24

I'm trying to understand how partitioning is done in Apache Spark. Can you guys help please?

Here is the scenario:

  • a master and two nodes with 1 core each
  • a file count.txt of 10 MB in size

How many partitions does the following create?

rdd = sc.textFile(count.txt)

Does the size of the file have any impact on the number of partitions?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
abhishek kurasala
  • 295
  • 1
  • 3
  • 6

1 Answers1

24

By default a partition is created for each HDFS partition, which by default is 64MB (from the Spark Programming Guide).

It's possible to pass another parameter defaultMinPartitions which overrides the minimum number of partitions that spark will create. If you don't override this value then spark will create at least as many partitions as spark.default.parallelism.

Since spark.default.parallelism is supposed to be the number of cores across all of the machines in your cluster I believe that there would be at least 3 partitions created in your case.

You can also repartition or coalesce an RDD to change the number of partitions that in turn influences the total amount of available parallelism.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
mrmcgreg
  • 2,754
  • 1
  • 23
  • 26
  • 2
    @jacek In case of `default.parallelism` (3 partitions created) and data file being 10 MB (single block on HDFS), how much data will the Spark partitions contain? Will it be: **1. Divided in 3 equal parts (3.3 MB each) and sent to executors.** _2. Not divided (P1=10MB, P2=P3=0 MB) and executed on same node because of Data Locality._ **3. Random shuffle of data in all 3 partitions.** – CᴴᴀZ Oct 26 '16 at 15:13
  • 2
    @mrmcgrep, There is a confusion, in the first statement you said by default a partition will be created for each HDFS block and then in the 3rd statement you said if we don't override `defaultMinPartitions` then it will create at least as many partitions as `spark.default.parallelism` which is suppose to be the number of cores across the cluster. So will it create `partition=HDFS partition` or `=number of cores`? – Explorer Mar 08 '17 at 16:42
  • @LiveAndLetLive I believe that these are all minimums. You'll have at least as many partitions as the smallest of the three values. – mrmcgreg Mar 08 '17 at 17:01
  • 1
    The information about HDFS seems to be outdated. The doc says "By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS)" http://spark.apache.org/docs/2.4.5/rdd-programming-guide.html – Dagang Jul 11 '20 at 17:20
  • 1
    `spark.default.parallelism` is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user. https://spark.apache.org/docs/2.4.5/configuration.html – Dagang Jul 11 '20 at 17:29
  • I have a bit of an issue with this old answer. What if you have a really big file with 100's of blocks? Surely the point is the number of partitions is that that you can process concurrently and the overall number of partitions. Or am I missing something? You will often have less cores / executors than overall number of partitions. – thebluephantom Feb 05 '21 at 12:48