Total zero count across all columns in a pyspark dataframe

Question

I need to find the percentage of zero across all columns in a pyspark dataframe. How to find the count of zero across each columns in the dataframe?

P.S: I have tried converting the dataframe into a pandas dataframe and used value_counts. But inferring it's observation is not possible for a large dataset.

Possible duplicate of: [Spark DataFrame: Computing row-wise mean (or any aggregate operation)](https://stackoverflow.com/questions/32670958/spark-dataframe-computing-row-wise-mean-or-any-aggregate-operation) and [Apply a transformation to multiple columns pyspark dataframe](https://stackoverflow.com/questions/48452076/apply-a-transformation-to-multiple-columns-pyspark-dataframe) — pault, Aug 20 '18 at 13:18

score 6 · Answer 1 · answered Mar 04 '19 at 17:59

"How to find the count of zero across each columns in the dataframe?"

First:

import pyspark.sql.functions as F
df_zero = df.select([F.count(F.when(df[c] == 0, c)).alias(c) for c in df.columns])

Second: you can then see the count (compared to .show(), this gives you better view. And the speed is not much different):

df_zero.limit(2).toPandas().head()

Enjoy! :)

score 0 · Answer 2 · edited Jan 15 '22 at 04:01

0

Use this code to find number of 0 in a column of a table.

Just replace Tablename and "column name" with the appropriate values:

Tablename.filter(col("column name")==0).count()

edited Jan 15 '22 at 04:01

Henry Ecker

34,399
18
41
57

answered Jan 12 '22 at 13:52

Rithwik Reddy

1
1

Total zero count across all columns in a pyspark dataframe

2 Answers2