1

Is there a difference between left-anti join and except in Spark in my implementation below?

Except when both DFs have 3 cols.

scala> val someDF5 = Seq(
     |     ("202003101750", "202003101700", 122),
     |     ("202003101800", "202003101700", 12),
     |     ("202003101750", "202003101700", 42),
     |     ("202003101810", "202003101700", 2),
     |       ("202003101810", "22222222", 222)
     |   ).toDF("number", "word", "value")
someDF5: org.apache.spark.sql.DataFrame = [number: string, word: string ... 1 more field]
scala> val someDF = Seq(
     |   ("202003101750", "202003101700",122),
     |   ("202003101800", "202003101700",12),
     |   ("202003101750", "202003101700",42)
     | ).toDF("number", "word","value")
someDF: org.apache.spark.sql.DataFrame = [number: string, word: string ... 1 more field]
scala> someDF5.except(someDF).show()
+------------+------------+-----+                                               
|      number|        word|value|
+------------+------------+-----+
|202003101810|202003101700|    2|
|202003101810|    22222222|  222|
+------------+------------+-----+

Left-Anti when 1 DF has 2 cols and another one has 3 cols

scala> val someDF4 = someDF.drop("value")
someDF4: org.apache.spark.sql.DataFrame = [number: string, word: string]
scala> someDF5.join(someDF4, Seq("number","word"), "left_anti").orderBy($"number".desc).show()
+------------+------------+-----+
|      number|        word|value|
+------------+------------+-----+
|202003101810|202003101700|    2|
|202003101810|    22222222|  222|
+------------+------------+-----+

The outputs match, and with left-anti I dont need the same number of cols in both tables. So, am I actually getting the same output using both except and left-anti join?

coderWorld
  • 602
  • 1
  • 8
  • 28

1 Answers1

3

No, It will not give you same result, It depends on the data in both DataFrames & columns you are using in left_anit join.

To show this I have modified your sample data.

except needs both DataFrame should have same number of columns, It will compare column to column in both DataFrame.

left_anti Both DataFrame can have multiple number of columns except joining columns. It will only compare joining columns.

Performance wise left_anti is faster than except Took your sample data to execute

  1. except took 316 ms to process & display data.
  2. left_anti took 60 ms to process & display data.
scala> val someDF5 = Seq(
     |     ("202003101750", "202003101700", 122),
     |     ("202003101800", "202003101700", 12),
     |     ("202003101750", "202003101700", 42),
     |     ("202003101810", "202003101700", 2),
     |     ("202003101810", "22222222", 222),
     |     ("302003101810", "33333333", 10)
     |
     | ).toDF("number", "word", "value")
someDF5: org.apache.spark.sql.DataFrame = [number: string, word: string ... 1 more field]

scala> val someDF = Seq(
     |     ("202003101750", "202003101700",122),
     |     ("202003101800", "202003101700",12),
     |     ("202003101750", "202003101700",42),
     |     ("302003101810", "33333333", 500)
     |     ).toDF("number", "word","value")
someDF: org.apache.spark.sql.DataFrame = [number: string, word: string ... 1 more field]

scala> //Using except

scala> spark.time{
     |     someDF5.except(someDF).show()
     | }
+------------+------------+-----+
|      number|        word|value|
+------------+------------+-----+
|302003101810|    33333333|   10|
|202003101810|202003101700|    2|
|202003101810|    22222222|  222|
+------------+------------+-----+

Time taken: 327 ms

scala> //Using left_anti join.

scala> spark.time{
     |     val someDF4 = someDF.drop("value")
     |     someDF5.join(someDF4, Seq("number","word"), "left_anti").orderBy($"number".desc).show()
     | }
+------------+------------+-----+
|      number|        word|value|
+------------+------------+-----+
|202003101810|202003101700|    2|
|202003101810|    22222222|  222|
+------------+------------+-----+

Time taken: 61 ms

// Added all columns in left_anti join, Now it will give same result as except.

scala> spark.time{
     |     someDF5.join(someDF, Seq("number","word","value"), "left_anti").orderBy($"number".desc).show()
     | }
+------------+------------+-----+
|      number|        word|value|
+------------+------------+-----+
|302003101810|    33333333|   10|
|202003101810|202003101700|    2|
|202003101810|    22222222|  222|
+------------+------------+-----+

Time taken: 74 ms
Srinivas
  • 8,957
  • 2
  • 12
  • 26