158

I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this.

In Python, I can do this:

data.shape()

Is there a similar function in PySpark? This is my current solution, but I am looking for an element one

row_number = data.count()
column_number = len(data.dtypes)

The computation of the number of columns is not ideal...

ItsMe
  • 395
  • 2
  • 13
Xi Liang
  • 1,649
  • 3
  • 10
  • 5

5 Answers5

253

You can get its shape with:

print((df.count(), len(df.columns)))
yatu
  • 86,083
  • 12
  • 84
  • 139
George Fisher
  • 3,046
  • 2
  • 16
  • 15
80

Use df.count() to get the number of rows.

Tshilidzi Mudau
  • 7,373
  • 6
  • 36
  • 49
VMEscoli
  • 1,192
  • 1
  • 10
  • 20
55

Add this to the your code:

import pyspark
def spark_shape(self):
    return (self.count(), len(self.columns))
pyspark.sql.dataframe.DataFrame.shape = spark_shape

Then you can do

>>> df.shape()
(10000, 10)

But just remind you that .count() can be very slow for very large table that has not been persisted.

Louis Yang
  • 3,511
  • 1
  • 25
  • 24
  • 5
    I really think it's a bad idea to change the DataFrame API, without a valid reason to do so. just call `spark_shape(my_df)`... Moreover, possibly name the function something clearer like `compute_dataframe_shape`... – Davide Aug 10 '22 at 13:50
  • This tip is for data scientists or data analysts who have to constantly type those command every day multiple times when they do analysis on data. It is not for better engineering perspective or production code. – Louis Yang Sep 05 '22 at 09:07
12
print((df.count(), len(df.columns)))

is easier for smaller datasets.

However if the dataset is huge, an alternative approach would be to use pandas and arrows to convert the dataframe to pandas df and call shape

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.crossJoin.enabled", "true")
print(df.toPandas().shape)
Venzu251720
  • 177
  • 1
  • 2
  • 17
    Isn't .toPandas an action? Meaning: isn't this going to collect the data to your master, and then call shape on it? If so, it would be inadvisable to do that, unless you're sure it's going to fit in master's memory. – ponadto Apr 06 '20 at 16:57
  • 8
    If the dataset is huge, collecting to Pandas is exactly what you do NOT want to do. Btw: Why do you enable cross join for this? And does the arrow configuration help collecting to pandas? – Melkor.cz Aug 27 '20 at 11:49
4

I think there is not similar function like data.shape in Spark. But I will use len(data.columns) rather than len(data.dtypes)

YungChun
  • 105
  • 5