8

My Dataframe looks like below

ID,FirstName,LastName

1,Navee,Srikanth

2,,Srikanth 

3,Naveen,

Now My Problem statement is I have to remove the row number 2 since First Name is null.

I am using below pyspark script

join_Df1= Name.filter(Name.col(FirstName).isnotnull()).show()

I am getting error as

  File "D:\0\NameValidation.py", line 13, in <module>
join_Df1= filter(Name.FirstName.isnotnull()).show()

TypeError: 'Column' object is not callable

Can anyone please help me on this to resolve

koiralo
  • 22,594
  • 6
  • 51
  • 72
Naveen Srikanth
  • 739
  • 3
  • 11
  • 23
  • Check out the answer https://stackoverflow.com/questions/37262762/filter-pyspark-dataframe-column-with-none-value – Dhruv Aggarwal Jun 23 '17 at 05:59
  • Possible duplicate of [Filter Pyspark dataframe column with None value](https://stackoverflow.com/questions/37262762/filter-pyspark-dataframe-column-with-none-value) – Jacek Laskowski Jun 25 '17 at 17:23

3 Answers3

13

It looks like your DataFrame FirstName have empty value instead Null. Below are some options to try out:-

df = sqlContext.createDataFrame([[1,'Navee','Srikanth'], [2,'','Srikanth'] , [3,'Naveen','']], ['ID','FirstName','LastName'])
df.show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  2|         |Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where(df.FirstName.isNotNull()).show() #This doen't remove null because df have empty value
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  2|         |Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.filter(df.FirstName != '').show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+

df.where("FirstName != ''").show()
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
|  1|    Navee|Srikanth|
|  3|   Naveen|        |
+---+---------+--------+
Rakesh Kumar
  • 4,319
  • 2
  • 17
  • 30
7

You should be doing as below

join_Df1.filter(join_Df1.FirstName.isNotNull()).show

Hope this helps!

koiralo
  • 22,594
  • 6
  • 51
  • 72
-3

I think what you might need is this notnull().

So this is your input in csv file my_test.csv:

ID,FirstName,LastName
1,Navee,Srikanth

2,,Srikanth

3,Naveen

The code:

import pandas as pd
df = pd.read_csv("my_test.csv")

print(df[df['FirstName'].notnull()])

output:

  ID FirstName  LastName
0   1     Navee  Srikanth
2   3    Naveen       NaN

This is what you would like! df[df['FirstName'].notnull()]

output of df['FirstName'].notnull():

0     True
1    False
2     True

This creates a dataframe df where df['FirstName'].notnull() returns True

How this is checked? df['FirstName'].notnull() If the value for FirstName column is notnull return True else if NaN is present return False.

void
  • 2,571
  • 2
  • 20
  • 35