Calculate number of columns with missing values per each row in PySpark

Asked Nov 04 '22 at 07:05

Active Nov 04 '22 at 07:11

Viewed 112 times

Let see we have the following data set

columns = ['id', 'dogs', 'cats']
values = [(1, 2, 0),(2, None, None),(3, None,9)]
df = spark.createDataFrame(values,columns)
df.show()
+----+----+----+
|  id|dogs|cats|
+----+----+----+
|   1|   2|   0|
|   2|null|null|
|   3|null|   9|
+----+----+----+

I would like to calculate number ("miss_nb") and percents ("miss_pt") of columns with missing values per rows and get the following table

+----+-------+-------+
|  id|miss_nb|miss_pt|
+----+-------+-------+
|   1|      0|   0.00|
|   2|      2|   0.67|
|   3|      1|   0.33|
+----+-------+-------+

The number of columns should be any (non-fixed list).

How to do it?

Thanks!

edited Nov 04 '22 at 07:11

asked Nov 04 '22 at 07:05

Andrii

2,843
27
33

2

Does this answer your question? [Is there a way to count non-null values per row in a spark df?](https://stackoverflow.com/questions/55527301/is-there-a-way-to-count-non-null-values-per-row-in-a-spark-df) – samkart Nov 04 '22 at 07:22
yes in some sense. thank you – Andrii Nov 04 '22 at 07:28

Calculate number of columns with missing values per each row in PySpark

0 Answers0