0

What will be the number of partitions for 10 nodes cluster with 20 executors and code reading a folder with 100 files?

  • Possible duplicate of [How does partitioning work in Spark?](http://stackoverflow.com/questions/26368362/how-does-partitioning-work-in-spark) –  Nov 29 '16 at 14:43

3 Answers3

1

It is different in different modes that you are running and you can tune it up using the spark.default.parallelism setting. From Spark Documentation :

For operations like parallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

Link to related Documentation: http://spark.apache.org/docs/latest/configuration.html#execution-behavior

You can yourself change the number of partitions yourself depending upon the data that you are reading.Some of the Spark api's provide an additional setting for the number of partition.

Further to check how many partitions are getting created do as @Sandeep Purohit says

rdd.getNumPartitions

And it will result into the number of partitions that are getting created !

You can also change the number of partitons after it is created by using two Api's namely : coalesce and repartition

Link to Coalesce and Repartition : Spark - repartition() vs coalesce()

Community
  • 1
  • 1
Shivansh
  • 3,454
  • 23
  • 46
0

From Spark doc:

By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

Number of partitions also depends upon the size of the file. If the file size is too big, you may choose to have more partitions.

0

The number of partitions for the scala/java objects RDD will be dependent on the core of the machines and if you are creating RDD using Hadoop input files then it will dependent on block size of the hdfs (version dependent) you can find number of partitions in RDD as follows

rdd.getNumPartitions

Sandeep Purohit
  • 3,652
  • 18
  • 22