How to get the partitioner of a dataframe in pyspark?

Question

There are quite a lot of posts regarding how to partition a dataframe/rdd to improve performance. My question is much simpler: what's the most direct way to show the partitioner of a dataframe? By looking at the name, I guess df.rdd.partitioner would return the partitioner, however, it always return None:

df = spark.createDataFrame((("A", 1), ("B", 2), ("A", 3), ("C", 1)),['k','v']).repartition("k")

df.rdd.partitioner #None

I find one way to find the partitioner is to read the output of df.explain(). However, this prints quite a lot of other info (physical plan). Is there a more direct way to show just the partitioner of a dataframe/rdd?

@mayankagrawal: getNumPartitions only show a number but I want to see the partitioner-- whether it is a hash, range or customer partitioner. — sgu, Mar 13 '18 at 17:18
@sgu, do you mind explaining how do you tell which type of partitioner is used from the df.explain(). — victorx, Jul 17 '20 at 07:45

score 3 · Answer 1 · answered Jul 01 '21 at 12:36

As suggested in the comment above(mayank agrawal), we can use the executionQuery object to get some insights.

If we do not have a table we can use:

df._jdf.queryExecution().executedPlan().prettyJson()
df._jdf.queryExecution().sparkPlan().outputPartitioning().prettyJson()

whichever fits our goal

or if we have a hive table then we could also have something like this:

table = df._jdf.queryExecution().logical().tableName()

catalog = c.Catalog(spark)
for col in catalog.listColumns(table.split(".")[1], table.split(".")[0]):
    if col.isBucket:
        print(f"bucketed by {col.name}")

How to get the partitioner of a dataframe in pyspark?

1 Answers1