40

I wanted to evaluate two conditions in when like this :-

import pyspark.sql.functions as F

df = df.withColumn(
    'trueVal', F.when(df.value < 1 OR df.value2  == 'false' , 0 ).otherwise(df.value)) 

For this I get 'invalid syntax' for using 'OR'

Even I tried using nested when statements :-

df = df.withColumn(
    'v', 
    F.when(df.value < 1,(F.when( df.value =1,0).otherwise(df.value))).otherwise(df.value)
) 

For this i get 'keyword can't be an expression' for nested when statements.

How could I use multiple conditions in when any work around ?

pault
  • 41,343
  • 15
  • 107
  • 149
Kiran Bhagwat
  • 574
  • 1
  • 6
  • 13
  • 1
    This question is a bit old, but your `'keyword can't be an expression'` error is actually a result of using a single `=` rather than `==` in the inner `when`. – PMende May 07 '20 at 00:39

1 Answers1

96

pyspark.sql.functions.when takes a Boolean Column as its condition. When using PySpark, it's often useful to think "Column Expression" when you read "Column".

Logical operations on PySpark columns use the bitwise operators:

  • & for and
  • | for or
  • ~ for not

When combining these with comparison operators such as <, parenthesis are often needed.

In your case, the correct statement is:

import pyspark.sql.functions as F
df = df.withColumn('trueVal',
    F.when((df.value < 1) | (df.value2 == 'false'), 0).otherwise(df.value))

See also: SPARK-8568

blurry
  • 114
  • 2
  • 9
Daniel Shields
  • 1,452
  • 1
  • 12
  • 7