6

I need to find the percentage of zero across all columns in a pyspark dataframe. How to find the count of zero across each columns in the dataframe?

P.S: I have tried converting the dataframe into a pandas dataframe and used value_counts. But inferring it's observation is not possible for a large dataset.

  • Possible duplicate of: [Spark DataFrame: Computing row-wise mean (or any aggregate operation)](https://stackoverflow.com/questions/32670958/spark-dataframe-computing-row-wise-mean-or-any-aggregate-operation) and [Apply a transformation to multiple columns pyspark dataframe](https://stackoverflow.com/questions/48452076/apply-a-transformation-to-multiple-columns-pyspark-dataframe) – pault Aug 20 '18 at 13:18

2 Answers2

6

"How to find the count of zero across each columns in the dataframe?"

First:

import pyspark.sql.functions as F
df_zero = df.select([F.count(F.when(df[c] == 0, c)).alias(c) for c in df.columns])

Second: you can then see the count (compared to .show(), this gives you better view. And the speed is not much different):

df_zero.limit(2).toPandas().head()

Enjoy! :)

Victor Z
  • 717
  • 1
  • 8
  • 12
0

Use this code to find number of 0 in a column of a table.

Just replace Tablename and "column name" with the appropriate values:

Tablename.filter(col("column name")==0).count()
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57