Get current number of partitions of a DataFrame

Question

Is there any way to get the current number of partitions of a DataFrame? I checked the DataFrame javadoc (spark 1.6) and didn't found a method for that, or am I just missed it? (In case of JavaRDD there's a getNumPartitions() method.)

user4601931 · Accepted Answer · 2018-09-08T06:29:17.580

190

You need to call getNumPartitions() on the DataFrame's underlying RDD, e.g., df.rdd.getNumPartitions(). In the case of Scala, this is a parameterless method: df.rdd.getNumPartitions.

edited Sep 08 '18 at 06:29

answered Feb 11 '17 at 02:32

user4601931

4,982
5
30
42

3

minus the (), so not entirely correct - at least not with SCALA mode – thebluephantom Aug 07 '18 at 19:36
3

Does this cause a *conversion* (_expensive_) from `DF` to `RDD` ? – WestCoastProjects Jan 16 '19 at 20:58
6

This IS expensive – WestCoastProjects Jan 19 '19 at 06:34
@javadba Do you have an answer that doesn't appeal to the RDD API? – user4601931 Jun 02 '19 at 02:21
No I do not: and it is unfortunate that spark does not manage the metadata better along the lines of hive. Your answer is correct but also is my observation that this is costly. – WestCoastProjects Jun 02 '19 at 03:07
And about Hive tables, make sense to convert to RDD to get number of partitions? Supposing same result in Hive `show patitions tatbleName`, but `rdd.getNumPartions` better performance (and more elegant). – Peter Krauss Nov 24 '19 at 16:14
can we get partition number in map function ? such as `rdd.map{ r => this.partitionNum }` ? – Shawn.X Jun 24 '21 at 09:33

Ram Ghadiyaram · Answer 2 · 2019-07-30T18:41:21.573

dataframe.rdd.partitions.size is another alternative apart from df.rdd.getNumPartitions() or df.rdd.length.

let me explain you this with full example...

val x = (1 to 10).toList
val numberDF = x.toDF(“number”)
numberDF.rdd.partitions.size // => 4

To prove that how many number of partitions we got with above... save that dataframe as csv

numberDF.write.csv(“/Users/Ram.Ghadiyaram/output/numbers”)

Here is how the data is separated on the different partitions.

Partition 00000: 1, 2
Partition 00001: 3, 4, 5
Partition 00002: 6, 7
Partition 00003: 8, 9, 10

Update :

@Hemanth asked a good question in the comment... basically why number of partitions are 4 in above case

Short answer : Depends on cases where you are executing. since local[4] I used, I got 4 partitions.

Long answer :

I was running above program in my local machine and used master as local[4] based on that it was taking as 4 partitions.

val spark = SparkSession.builder()
    .appName(this.getClass.getName)
    .config("spark.master", "local[4]").getOrCreate()

If its spark-shell in master yarn I got the number of partitions as 2

example : spark-shell --master yarn and typed same commands again

scala> val x = (1 to 10).toList
x: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)


scala> val numberDF = x.toDF("number")
numberDF: org.apache.spark.sql.DataFrame = [number: int]

scala> numberDF.rdd.partitions.size
res0: Int = 2

here 2 is default parllelism of spark
Based on hashpartitioner spark will decide how many number of partitions to distribute. if you are running in --master local and based on your Runtime.getRuntime.availableProcessors() i.e. local[Runtime.getRuntime.availableProcessors()] it will try to allocate those number of partitions. if your available number of processors are 12 (i.e. local[Runtime.getRuntime.availableProcessors()]) and you have list of 1 to 10 then only 10 partitions will be created.

NOTE:

If you are on a 12-core laptop where I am executing spark program and by default the number of partitions/tasks is the number of all available cores i.e. 12. that means local[*] or s"local[${Runtime.getRuntime.availableProcessors()}]") but in this case only 10 numbers are there so it will limit to 10

keeping all these pointers in mind I would suggest you to try on your own

Thanks for the great answer. I am curious why a list of 10 numbers was divided into 4 partitions when converted to a DF. Can you kindly provide some explanation, please? — Hemanth, Jul 30 '19 at 17:47
Is this `since local[4] I used, I got 4 partitions.` still valid for 3.x? I've got 200 partitions with local[4]. — Sergey Bushmanov, Jun 19 '21 at 07:39
@Sergey Bushmanov : [see here](https://stackoverflow.com/a/57385715/647053) also [spark docs](https://spark.apache.org/docs/latest/configuration.html#execution-behavior) — Ram Ghadiyaram, Jun 20 '21 at 04:55
The 2 links you provided are indeed confirming that current number of partitions is different from `local[n]`. Actually, that num partitions has little to do with `local[n]` is expected due to the map/reduce parallelism. — Sergey Bushmanov, Jun 20 '21 at 06:32
can we get partition number in map function ? such as rdd.map{ r => this.partitionNum } ? — Shawn.X, Jun 24 '21 at 09:34

score 10 · Answer 3 · edited Apr 01 '17 at 06:24

10

convert to RDD then get the partitions length

DF.rdd.partitions.length

edited Apr 01 '17 at 06:24

Kartoch

7,610
9
40
68

answered Apr 01 '17 at 06:23

Bhargav Kosaraju

300
1
5

can we get partition number in map function ? such as rdd.map{ r => this.partitionNum } ? – Shawn.X Jun 24 '21 at 09:34

loneStar · Answer 4 · 2017-10-13T20:02:28.807

6

 val df = Seq(
  ("A", 1), ("B", 2), ("A", 3), ("C", 1)
).toDF("k", "v")

df.rdd.getNumPartitions

edited Oct 13 '17 at 20:02

answered Apr 22 '17 at 17:08

loneStar

3,780
23
40

Please read this [how-to-answer](http://stackoverflow.com/help/how-to-answer) for providing quality answer. – thewaywewere Apr 22 '17 at 17:35
can we get partition number in map function ? such as rdd.map{ r => this.partitionNum } ? – Shawn.X Jun 24 '21 at 09:34

score 0 · Answer 5 · answered Jun 27 '20 at 22:15

0

One more Interesting way to get number of partitions is 'using mapPartitions' transformation. Sample Code -

val x = (1 to 10).toList
val numberDF = x.toDF()
numberDF.rdd.mapPartitions(x => Iterator[Int](1)).sum()

Spark experts are welcome to comment on its performance.

answered Jun 27 '20 at 22:15

Shantanu Kher

1,014
1
8
14

can we get partition number in map function ? such as rdd.map{ r => this.partitionNum } ? – Shawn.X Jun 24 '21 at 09:35

Get current number of partitions of a DataFrame

5 Answers5

Update :

NOTE:

Linked

Related