18

The pyspark RDD documentation

http://spark.apache.org/docs/1.2.1/api/python/pyspark.html#pyspark.RDD

does not show any method(s) to display partition information for an RDD.

Is there any way to get that information without executing an additional step e.g.:

myrdd.mapPartitions(lambda x: iter[1]).sum()

The above does work .. but seems like extra effort.

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560

2 Answers2

46

I missed it: very simple:

rdd.getNumPartitions()

Not used to the java-ish getFooMethod() anymore ;)

Update : Adding in the comment from @dnlbrky :

dataFrame.rdd.getNumPartitions()
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
  • 8
    DataFrames were introduced in Spark 1.3, and are often used in place of RDDs. For those reading this answer and trying to get the number of partitions for a DataFrame, you have to convert it to an RDD first: `myDataFrame.rdd.getNumPartitions()`. – dnlbrky Apr 20 '15 at 19:58
2

The OP didn't specify which information he wanted to get for the partitions (but seemed happy enough with the number of partitions).

If it is the number of elements in each partition that you are looking for (as was the case here), the following solution works fine: https://gist.github.com/venuktan/bd3a6b6b83bd6bc39c9ce5810607a798

martin_wun
  • 1,599
  • 1
  • 15
  • 33