9

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.

I know that for a RDD, while creating it we can mention the number of partitions like below.

val RDD1 = sc.textFile("path" , 6) 

But for Spark dataframe while creating looks like we do not have option to specify number of partitions like for RDD.

Only possibility i think is, after creating dataframe we can use repartition API.

df.repartition(4)

So can anyone please let me know if we can specify the number of partitions while creating a dataframe.

Community
  • 1
  • 1
Ramesh
  • 1,563
  • 9
  • 25
  • 39
  • If the provided solution answers your question please accept it to close the issue or comment on it why it doesn't solve it ! – eliasah Oct 04 '16 at 20:50

2 Answers2

13

You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions.

In general:

  • Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism.
  • Datasets created from RDD inherit number of partitions from its parent.
  • Datsets created using data source API:

  • Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.
Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
0

Default number of shuffle partitions in spark dataframe(200)

Default number of partitions in rdd(10)

Arvind
  • 87
  • 13
Shyam Gupta
  • 489
  • 4
  • 8