Spark specify multiple logical condition in where clause of spark dataframe

Question

While defining the multiple logical/relational condition in spark scala dataframe getting the error as mentioned below. But same thing is working fine in scala

Python code:

df2=df1.where(((col('a')==col('b')) & (abs(col('c')) <= 1))
 | ((col('a')==col('fin')) & ((col('b') <= 3) & (col('c') > 1)) & (col('d') <= 500))
 | ((col('a')==col('b')) & ((col('c') <= 15) & (col('c') > 3)) & (col('d') <= 200))
 | ((col('a')==col('b')) & ((col('c') <= 30) & (col('c') > 15)) & (col('c') <= 100)))

Tried for scala equivalent:

val df_aqua_xentry_dtb_match=df_aqua_xentry.where((col("a") eq col("b")) &  (abs(col("c") ) <= 1))

 notebook:2: error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: Boolean

val df_aqua_xentry_dtb_match=df_aqua_xentry.where((col("a") eq col("b")) &  (abs(col("c") ) <= 1))

How to define multiple logical condition in spark dataframe using scala

Ravi · Answer 1 · 2019-08-02T07:55:50.243

Please see the below solution.

df.where("StudentId == 1").explain(true)

== Parsed Logical Plan ==
'Filter ('StudentId = 1)
+- Project [_1#3 AS StudentId#7, _2#4 AS SubjectName#8, _3#5 AS Marks#9]
   +- LocalRelation [_1#3, _2#4, _3#5]

== Analyzed Logical Plan ==
StudentId: int, SubjectName: string, Marks: int
Filter (StudentId#7 = 1)
+- Project [_1#3 AS StudentId#7, _2#4 AS SubjectName#8, _3#5 AS Marks#9]
   +- LocalRelation [_1#3, _2#4, _3#5]

== Optimized Logical Plan ==
LocalRelation [StudentId#7, SubjectName#8, Marks#9]

Here we used where clause, internally optimizer converted to filter opetration eventhough where clause in code level.

So we can apply filter function on rows of data frame like below

df.filter(row => row.getString(1) == "A" && row.getInt(0) == 1).show()

Here 0 and 1 are columns of data frames. In my case schema is (StudentId(Int), SubjectName(string), Marks(Int))

score 0 · Answer 2 · answered Aug 02 '19 at 07:50

eq returns a Boolean, <= returns a Column. They are incompatible.

You probably want this :

df.where((col("a") === col("b")) && (abs(col("c") ) <= 1))

=== is used for equality between columns and returns a Column, and there we can use && to do multiple conditions in the same where.

score 0 · Answer 3 · answered Aug 02 '19 at 07:54

0

With Spark you should use

=== instead of == or eq (see explanation)
&& instead of & (&& is logical AND, & is binary AND)

val df_aqua_xentry_dtb_match = df_aqua_xentry.where((col("a") === col("b")) &&  (abs(col("c") ) <= 1))

answered Aug 02 '19 at 07:54

vinsce

1,271
1
10
19

Yeah. Thanks for your recommendation. Its working now. val df2=df1.where(((col("a") ===col("b")) && (abs(col("c") ) <= 1)) || ((col("a")===col("b")) && ((col("c") <= 3) && (col("c") > 1)) && (col("d") <= 500)) || ((col("a")===col("b")) && ((col("c") <= 15) && (col("c") > 3)) && (col("d") <= 200)) || ((col("a")===col("b")) && ((col("c") <= 30) && (col("c") > 30)) && (col("d") <= 200))) – pankajs Aug 02 '19 at 09:29

Goldie · Answer 4 · 2019-08-02T10:30:18.417

There are few issues with your Scala version of code.

"eq" is basically to compare two strings in Scala (desugars to == in Java) so when you try to compare two Columns using "eq", it returns a boolean instead of Column type. Here you can use "===" operator for Column comparison.

String comparison

    scala> "praveen" eq "praveen"
    res54: Boolean = true 

    scala> "praveen" eq "nag"
    res55: Boolean = false   

    scala> lit(1) eq lit(2)
    res56: Boolean = false  

    scala> lit(1) eq lit(1)
    res57: Boolean = false

Column comparison

    scala> lit(1) === lit(2)
    res58: org.apache.spark.sql.Column = (1 = 2)

    scala> lit(1) === lit(1)
    19/08/02 14:00:40 WARN Column: Constructing trivially true equals predicate, '1 = 1'. Perhaps you need to use aliases.
    res59: org.apache.spark.sql.Column = (1 = 1)

You are using a "betwise AND" operator instead of "and"/"&&" operator for Column type. This is reason you were getting the above error (as it was expecting a boolean instead Column).

  scala> df.show
  +---+---+
  | id|id1|
  +---+---+
  |  1|  2|
  +---+---+

  scala> df.where((col("id") === col("id1")) && (abs(col("id")) > 2)).show
  +---+---+
  | id|id1|
  +---+---+
  +---+---+

  scala> df.where((col("id") === col("id1")) and (abs(col("id")) > 2)).show
  +---+---+
  | id|id1|
  +---+---+
  +---+---+

Hope this helps !

Spark specify multiple logical condition in where clause of spark dataframe

4 Answers4