There are quite a lot of posts regarding how to partition a dataframe/rdd to improve performance. My question is much simpler: what's the most direct way to show the partitioner of a dataframe? By looking at the name, I guess df.rdd.partitioner
would return the partitioner, however, it always return None:
df = spark.createDataFrame((("A", 1), ("B", 2), ("A", 3), ("C", 1)),['k','v']).repartition("k")
df.rdd.partitioner #None
I find one way to find the partitioner is to read the output of df.explain()
. However, this prints quite a lot of other info (physical plan). Is there a more direct way to show just the partitioner of a dataframe/rdd?