0

Using df.rdd.getNumPartitions(), we can get the count of partitions. But how do we get the partitions?

I also tried to pick something up from the documentation and all the attributes (using dir(df)) of a dataframe. However, I could not find any API that would give the partitions, only repartitioning, coalesce, getNumPartitions were all that I could find.

I read this and deduced that Spark does not know the partitioning key(s). My doubt is, if it does not know the partitioning key(s), and hence, does not know the partitions, how can it know their count? If it can, how to determine the partitions?

Aviral Srivastava
  • 4,058
  • 8
  • 29
  • 81
  • What do you mean by determining the partitions? Content, method? Partition number? – thebluephantom Jun 17 '20 at 18:10
  • I am sorry for not being able to provide you the exact answer. By 'partition', I mean the folders in which the data gets partitioned. For eg, created_month * 12 folders, each folder containing created_day * 30(almost) folders, and then files being stored in each of such `day` folder. Does it make sense? – Aviral Srivastava Jun 17 '20 at 18:17
  • No, as a dataframe is in memory and may be evicted to disk, but it sounds like yo mean when dataframe written to disk. That is data at rest and indeed different. – thebluephantom Jun 17 '20 at 18:20
  • I am confused about the data you read and the outcome you expect. Can you clarify a bit? You can get the partitions using the partitions() method, as in: https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#partitions-- – jgp Jun 17 '20 at 18:21
  • @jpg When I typed `df.rdd.partitions`, I got: `AttributeError: 'RDD' object has no attribute 'partitions'` – Aviral Srivastava Jun 17 '20 at 18:45

2 Answers2

2

How about checking what the partition contains using mapPartitionsWithIndex

This code will work for some small dataset

def f(splitIndex, elements): 
  elements_text = ",".join(list(elements))
  yield splitIndex, elements_text

rdd.mapPartitionsWithIndex(f).take(10)
luk
  • 105
  • 4
1

pyspark provides the spark_partition_id() function.

spark_partition_id()

A column for partition ID.

Note: This is indeterministic because it depends on data partitioning and task scheduling.

>>> from pyspark.sql.functions import *
>>> spark.range(1,1000000)
      .withColumn("spark_partition",spark_partition_id())
      .groupby("spark_partition")
      .count().show(truncate=False)
+---------------+------+
|spark_partition|count |
+---------------+------+
|1              |500000|
|0              |499999|
+---------------+------+

Partitions are numbered from zero to n-1 where n is the number you get from getNumPartitions().

Is that what you're after? Or did you actually mean Hive partitions?

mazaneicha
  • 8,794
  • 4
  • 33
  • 52