4

There are quite a lot of posts regarding how to partition a dataframe/rdd to improve performance. My question is much simpler: what's the most direct way to show the partitioner of a dataframe? By looking at the name, I guess df.rdd.partitioner would return the partitioner, however, it always return None:

df = spark.createDataFrame((("A", 1), ("B", 2), ("A", 3), ("C", 1)),['k','v']).repartition("k")

df.rdd.partitioner #None

I find one way to find the partitioner is to read the output of df.explain(). However, this prints quite a lot of other info (physical plan). Is there a more direct way to show just the partitioner of a dataframe/rdd?

sgu
  • 1,301
  • 1
  • 13
  • 25

1 Answers1

3

As suggested in the comment above(mayank agrawal), we can use the executionQuery object to get some insights.

If we do not have a table we can use:

df._jdf.queryExecution().executedPlan().prettyJson()
df._jdf.queryExecution().sparkPlan().outputPartitioning().prettyJson()

whichever fits our goal

or if we have a hive table then we could also have something like this:

table = df._jdf.queryExecution().logical().tableName()

catalog = c.Catalog(spark)
for col in catalog.listColumns(table.split(".")[1], table.split(".")[0]):
    if col.isBucket:
        print(f"bucketed by {col.name}")