0

I have to apply a filter with multiple conditions using OR on a pyspark dataframe.

I am trying to create a separate dataframe. Date value must be less than max_date or Date must be None.

How to do it?

I tried below 3 options but they all failed.

df.filter(df['Date'] < max_date or df['Date'] == None).createOrReplaceTempView("Final_dataset")

final_df = df.filter(df['Date'] != max_date | df['Date'] is None)

final_df = df.filter(df['Date'] != max_date or df['Date'] is None)
Harsha Biyani
  • 7,049
  • 9
  • 37
  • 61
Mikesama
  • 186
  • 1
  • 10

1 Answers1

1
final_df = df.filter((df.Date < max_date) | (df.Date.isNull()))

Regular logical python operators don't work in Pyspark conditions; you need to use bitwise operators. They can also be a bit tricky so you might need extra parenthesis to disambiguate the expression.

Have a look here: Boolean operators vs Bitwise operators

Gustavo Puma
  • 993
  • 12
  • 27