1

I am trying to understand RDD partitioning logic. RDD is partitioned across nodes but wants to understand how this partitioning logic works.

I have VM with 4 cores assigned to it. I created two RDD , one from HDFS and one from parallelize operation.

enter image description here

First time two partition got created but in second operation 4 partition got created.

I checked no of blocks allocated to file - it was 1 block as file is very small but when I created RDD on that file , it shows two partitions. Why is this ? I read somewhere that partitioning also depends on no of core which 4 in my case which still does not satisfies that output.

Can someone help to understand this?

Shashi
  • 2,686
  • 7
  • 35
  • 67
  • 1
    Possible duplicate of [How does partitioning work for data from files on HDFS?](http://stackoverflow.com/questions/29011574/how-does-partitioning-work-for-data-from-files-on-hdfs) – sgvd May 22 '16 at 12:53

1 Answers1

2

The full signature of textFile is:

textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

With the second argument, minPartitions, you can set the minimum amount of partitions you want to get. As you can see, by default it is set to defaultMinPartitions, which in turn is defined as:

def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

The value of defaultParalellism is configured with the spark.default.parallelism setting, which by default depends on your number of cores when running Spark in local mode. This is 4 in your case, so you get min(4, 2), which is why you get 2 partitions.

sgvd
  • 3,819
  • 18
  • 31
  • great answers. How did you figure that "The value of defaultParalellism is configured with the spark.default.parallelism setting" ?? – human Aug 18 '17 at 06:12
  • Firstly by assuming the naming of the variable and the setting is not accidental ;) but by following the code: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2321 sets it from the taskscheduler, which here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L517 gets it from the backend which here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/local/LocalSchedulerBackend.scala#L144 gets it from the setting – sgvd Aug 18 '17 at 15:35