How to return rows with Null values in pyspark dataframe?

Question

I am trying to get the rows with null values from a pyspark dataframe. In pandas, I can achieve this using isnull() on the dataframe:

df = df[df.isnull().any(axis=1)]

But in case of PySpark, when I am running below command it shows Attributeerror:

df.filter(df.isNull())

AttributeError: 'DataFrame' object has no attribute 'isNull'.

How can get the rows with null values without checking it for each column?

see https://stackoverflow.com/questions/37262762/filter-pyspark-dataframe-column-with-none-value — Mikhail Berlinkov, Nov 26 '18 at 18:33
@Mikhail I think that the question linked is not the answer because this question ask about all columns at the same time, without checking for each column. — Amanda, Dec 11 '18 at 09:26

score 10 · Answer 1 · edited May 03 '19 at 07:03

You can filter the rows with where, reduce and a list comprehension. For example, given the following dataframe:

df = sc.parallelize([
    (0.4, 0.3),
    (None, 0.11),
    (9.7, None), 
    (None, None)
]).toDF(["A", "B"])

df.show()
+----+----+
|   A|   B|
+----+----+
| 0.4| 0.3|
|null|0.11|
| 9.7|null|
|null|null|
+----+----+

Filtering the rows with some null value could be achieved with:

import pyspark.sql.functions as f
from functools import reduce

df.where(reduce(lambda x, y: x | y, (f.col(x).isNull() for x in df.columns))).show()

Which gives:

+----+----+
|   A|   B|
+----+----+
|null|0.11|
| 9.7|null|
|null|null|
+----+----+

In the condition statement you have to specify if any (or, |), all (and, &), etc.

Good answer, I was wondering how I could construct the boolean expression programmatically — information_interchange, Aug 06 '19 at 20:50

ravi malhotra · Answer 2 · 2020-07-21T10:51:15.363

0

This is how you can do this in scala

import org.apache.spark.sql.functions._

case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])

val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()
    
display(df1.filter(df1.columns.map(c => col(c).isNull).reduce((a,b) => a || b)))

edited Jul 21 '20 at 10:51

answered Jul 21 '20 at 09:00

ravi malhotra

703
5
14

How to return rows with Null values in pyspark dataframe?

2 Answers2