pyspark dataframe when and multiple otherwise clause

Question

I'll need to create an if multiple else in a pyspark dataframe.

I have two columns to be logically tested.

Logic is below:

If Column A OR Column B contains "something", then write "X"
Else If (Numeric Value in a string of Column A + Numeric Value in a string of Column B) > 100 , then write "X"
Else If (Numeric Value in a string of Column A + Numeric Value in a string of Column B) > 50 , then write "Y"
Else If (Numeric Value in a string of Column A + Numeric Value in a string of Column B) > 0 , then write "Z"
Else, then write "T"

to a new column "RESULT"

I thought the quickest search method is when, otherwise, otherwise, otherwise, otherwise and failed in the query below.

I'd be appreciated if you suggest a quicker method for that.

Note: when(clause).when(clause).when(clause).when(clause).otherwise(clause) searches the whole table again and again. I want to proceed with unmatched data only.

df = df.withColumn('RESULT', F.when(\
F.when((F.col("A").like("%something%") | F.col("B").like("%something%")), "X").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 100), "X").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 29), "Y").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 0), "Z").otherwise(\
"T"))))))

Desired Output

don't do `when(condition1, ...).otherwise(when(condition2, ...).otherwise(when(condition3, ...)))`. The proper way is to do `.when(condition1, ...).when(condition2, ...).when(condition3, ...).otherwise()` — pault, Mar 25 '19 at 20:10
Possible duplicate of [Spark Equivalent of IF Then ELSE](https://stackoverflow.com/questions/39048229/spark-equivalent-of-if-then-else) — pault, Mar 25 '19 at 20:10
@pault when().when().when().otherwise() searches the whole table all over again and again. I want to process remaining unmatched data each time. — korayckk, Mar 25 '19 at 20:12
*when().when().when().otherwise() searches the whole table all over again and again* - this claim is incorrect. It is the equivalent of if-elif-elif-else (show me the documentation if you think I'm wrong) — pault, Mar 25 '19 at 20:13
@pault I could not find any document rather says it's slower or faster but I tested few queries before and saw a slightly significant difference. — korayckk, Mar 25 '19 at 20:17
Then you'll have to add in a [mcve] that demonstrates that difference. It's more likely that you had a different error in your code. — pault, Mar 25 '19 at 20:18

score 0 · Answer 1 · answered Mar 25 '19 at 20:41

0

I got the solution anyway.

df = df.withColumn('RESULT',\
F.when((F.col("A").like("%something%") | F.col("B").like("%something%")), "X").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 100), "X").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 29), "Y").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 0), "Z").otherwise(\
"T")))))

answered Mar 25 '19 at 20:41

korayckk

36
1
6

you can prove to yourself that this is equivalent to `when().when().when().otherwise()` by comparing the execution plans for both methods (`df.explain()`). You will see that they are identical. – pault Mar 25 '19 at 20:50
@pault thanks for the guidance; I'll try when().when().when().otherwise() performance and I'll let you know the result whether they perform identical or not. – korayckk Mar 25 '19 at 21:13

pyspark dataframe when and multiple otherwise clause

1 Answers1