49

I have seen this question earlier here and I have took lessons from that. However I am not sure why I am getting an error when I feel it should work.

I want to create a new column in existing Spark DataFrame by some rules. Here is what I wrote. iris_spark is the data frame with a categorical variable iris_spark with three distinct categories.

from pyspark.sql import functions as F

iris_spark_df = iris_spark.withColumn(
    "Class", 
   F.when(iris_spark.iris_class == 'Iris-setosa', 0, F.when(iris_spark.iris_class == 'Iris-versicolor',1)).otherwise(2))

Throws the following error.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-157-21818c7dc060> in <module>()
----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1)))

TypeError: when() takes exactly 2 arguments (3 given)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-157-21818c7dc060> in <module>()
----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1)))

TypeError: when() takes exactly 2 arguments (3 given)

Any idea why?

Community
  • 1
  • 1
Baktaawar
  • 7,086
  • 24
  • 81
  • 149

4 Answers4

77

Correct structure is either:

(when(col("iris_class") == 'Iris-setosa', 0)
.when(col("iris_class") == 'Iris-versicolor', 1)
.otherwise(2))

which is equivalent to

CASE 
    WHEN (iris_class = 'Iris-setosa') THEN 0
    WHEN (iris_class = 'Iris-versicolor') THEN 1 
    ELSE 2
END

or:

(when(col("iris_class") == 'Iris-setosa', 0)
    .otherwise(when(col("iris_class") == 'Iris-versicolor', 1)
        .otherwise(2)))

which is equivalent to:

CASE WHEN (iris_class = 'Iris-setosa') THEN 0 
     ELSE CASE WHEN (iris_class = 'Iris-versicolor') THEN 1 
               ELSE 2 
          END 
END

with general syntax:

when(condition, value).when(...)

or

when(condition, value).otherwise(...)

You probably mixed up things with Hive IF conditional:

IF(condition, if-true, if-false)

which can be used only in raw SQL with Hive support.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • 1
    Adding slightly more context: you'll need `from pyspark.sql.functions import when` for this. – Sarah Messer Jul 06 '20 at 20:09
  • 1
    When you chain multiple `when` without `otherwise` in between, note that when multiple `when` cases are true, only the first true `when` will be evaluated. – Safwan Sep 26 '20 at 09:59
22

Conditional statement In Spark

  • Using “when otherwise” on DataFrame
  • Using “case when” on DataFrame
  • Using && and || operator

import org.apache.spark.sql.functions.{when, _}
import spark.sqlContext.implicits._

val spark: SparkSession = SparkSession.builder().master("local[1]").appName("SparkByExamples.com").getOrCreate()

val data = List(("James ","","Smith","36636","M",60000),
        ("Michael ","Rose","","40288","M",70000),
        ("Robert ","","Williams","42114","",400000),
        ("Maria ","Anne","Jones","39192","F",500000),
        ("Jen","Mary","Brown","","F",0))

val cols = Seq("first_name","middle_name","last_name","dob","gender","salary")
val df = spark.createDataFrame(data).toDF(cols:_*)

1. Using “when otherwise” on DataFrame

Replace the value of gender with new value

val df1 = df.withColumn("new_gender", when(col("gender") === "M","Male")
      .when(col("gender") === "F","Female")
      .otherwise("Unknown"))

val df2 = df.select(col("*"), when(col("gender") === "M","Male")
      .when(col("gender") === "F","Female")
      .otherwise("Unknown").alias("new_gender"))

2. Using “case when” on DataFrame

val df3 = df.withColumn("new_gender",
  expr("case when gender = 'M' then 'Male' " +
                   "when gender = 'F' then 'Female' " +
                   "else 'Unknown' end"))

Alternatively,

val df4 = df.select(col("*"),
      expr("case when gender = 'M' then 'Male' " +
                       "when gender = 'F' then 'Female' " +
                       "else 'Unknown' end").alias("new_gender"))

3. Using && and || operator

val dataDF = Seq(
      (66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4"
      )).toDF("id", "code", "amt")
dataDF.withColumn("new_column",
       when(col("code") === "a" || col("code") === "d", "A")
      .when(col("code") === "b" && col("amt") === "4", "B")
      .otherwise("A1"))
      .show()

Output:

+---+----+---+----------+
| id|code|amt|new_column|
+---+----+---+----------+
| 66|   a|  4|         A|
| 67|   a|  0|         A|
| 70|   b|  4|         B|
| 71|   d|  4|         A|
+---+----+---+----------+
vj sreenivasan
  • 1,283
  • 13
  • 15
  • 1
    The answer is very nicely detailed, buy OP's tags & question are clearly Python-focused and this answer is done entirely in Scala. Answer could be improved further by noting Python syntax which is often but not always very similar to the Scala equivalent. – Sarah Messer Jul 06 '20 at 20:08
  • In pyspark && and || operator doesn't exists and it throws SyntaxError. for better understanding refer this link https://stackoverflow.com/questions/37707305/pyspark-multiple-conditions-in-when-clause#:~:text=You%20get%20SyntaxError%20error%20exception,doesn't%20consider%20operator%20precedence. – Pramod Kumar Sharma Apr 27 '21 at 20:18
7

There are different ways you can achieve if-then-else.

  1. Using when function in DataFrame API. You can specify the list of conditions in when and also can specify otherwise what value you need. You can use this expression in nested form as well.

  2. expr function. Using "expr" function you can pass SQL expression in expr. PFB example. Here we are creating new column "quarter" based on month column.

cond = """case when month > 9 then 'Q4'
            else case when month > 6 then 'Q3'
                else case when month > 3 then 'Q2'
                    else case when month > 0 then 'Q1'
                        end
                    end
                end
            end as quarter"""

newdf = df.withColumn("quarter", expr(cond))
  1. selectExpr function. We can also use the variant of select function which can take SQL expression. PFB example.
    cond = """case when month > 9 then 'Q4'
                else case when month > 6 then 'Q3'
                    else case when month > 3 then 'Q2'
                        else case when month > 0 then 'Q1'
                            end
                        end
                    end
                end as quarter"""

    newdf = df.selectExpr("*", cond)

user3190018
  • 890
  • 13
  • 26
Neeraj Bhadani
  • 2,930
  • 16
  • 26
3

you can use this: if(exp1, exp2, exp3) inside spark.sql() where exp1 is condition and if true give me exp2, else give me exp3.

now the funny thing with nested if-else is. you need to pass every exp inside

brackets {"()"}
else it will raise error.

example:

if((1>2), (if (2>3), True, False), (False))
vikrant rana
  • 4,509
  • 6
  • 32
  • 72
vermaji
  • 31
  • 4