8

I was wondering if there are performance difference between calling except (https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#except(org.apache.spark.sql.Dataset) and using a left anti-join. So far, the only difference I can see is that with the left anti-join, the 2 datasets can have different columns.

alexgbelov
  • 3,032
  • 4
  • 28
  • 42

1 Answers1

5

Your title vs. explanation differ.

But, if you have the same structure you can use both methods to find missing data.

EXCEPT

is a specific implementation that enforces same structure and is a subtract operation, whereas

LEFT ANTI JOIN

allows different structures as you would say, but can give the same result.

Use cases differ: 1) Left Anti Join can apply to many situations pertaining to missing data - customers with no orders (yet), orphans in a database. 2) Except is for subtracting things, e.g. Machine Learning splitting data into test- and training sets.

Performance should not be a real deal breaker as they are different use cases in general and therefore difficult to compare. Except will involve the same data source whereas LAJ will involve different data sources.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83