1

I'll need to create an if multiple else in a pyspark dataframe.

I have two columns to be logically tested.

Logic is below:

  1. If Column A OR Column B contains "something", then write "X"
  2. Else If (Numeric Value in a string of Column A + Numeric Value in a string of Column B) > 100 , then write "X"
  3. Else If (Numeric Value in a string of Column A + Numeric Value in a string of Column B) > 50 , then write "Y"
  4. Else If (Numeric Value in a string of Column A + Numeric Value in a string of Column B) > 0 , then write "Z"
  5. Else, then write "T"

to a new column "RESULT"

I thought the quickest search method is when, otherwise, otherwise, otherwise, otherwise and failed in the query below.

I'd be appreciated if you suggest a quicker method for that.

Note: when(clause).when(clause).when(clause).when(clause).otherwise(clause) searches the whole table again and again. I want to proceed with unmatched data only.

df = df.withColumn('RESULT', F.when(\
F.when((F.col("A").like("%something%") | F.col("B").like("%something%")), "X").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 100), "X").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 29), "Y").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 0), "Z").otherwise(\
"T"))))))

Desired Output

korayckk
  • 36
  • 1
  • 6
  • don't do `when(condition1, ...).otherwise(when(condition2, ...).otherwise(when(condition3, ...)))`. The proper way is to do `.when(condition1, ...).when(condition2, ...).when(condition3, ...).otherwise()` – pault Mar 25 '19 at 20:10
  • Possible duplicate of [Spark Equivalent of IF Then ELSE](https://stackoverflow.com/questions/39048229/spark-equivalent-of-if-then-else) – pault Mar 25 '19 at 20:10
  • @pault when().when().when().otherwise() searches the whole table all over again and again. I want to process remaining unmatched data each time. – korayckk Mar 25 '19 at 20:12
  • *when().when().when().otherwise() searches the whole table all over again and again* - this claim is incorrect. It is the equivalent of if-elif-elif-else (show me the documentation if you think I'm wrong) – pault Mar 25 '19 at 20:13
  • @pault I could not find any document rather says it's slower or faster but I tested few queries before and saw a slightly significant difference. – korayckk Mar 25 '19 at 20:17
  • Then you'll have to add in a [mcve] that demonstrates that difference. It's more likely that you had a different error in your code. – pault Mar 25 '19 at 20:18

1 Answers1

0

I got the solution anyway.

df = df.withColumn('RESULT',\
F.when((F.col("A").like("%something%") | F.col("B").like("%something%")), "X").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 100), "X").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 29), "Y").otherwise(\
F.when((((F.regexp_extract(F.col("A"), ".(\d+).", 1)) + F.regexp_extract(F.col("B"), ".(\d+).", 1)) > 0), "Z").otherwise(\
"T")))))
korayckk
  • 36
  • 1
  • 6
  • you can prove to yourself that this is equivalent to `when().when().when().otherwise()` by comparing the execution plans for both methods (`df.explain()`). You will see that they are identical. – pault Mar 25 '19 at 20:50
  • @pault thanks for the guidance; I'll try when().when().when().otherwise() performance and I'll let you know the result whether they perform identical or not. – korayckk Mar 25 '19 at 21:13