I have two tables in Apache Spark 2.x. Each table has a common row "IDNUM". Call them table A and table B.
This is fast in Apache SparkSQL:
SELECT COUNT(*) FROM A where IDNUM IN (SELECT IDNUM FROM B);
This is incredibly slow:
SELECT COUNT(*) FROM A where IDNUM NOT IN (SELECT IDNUM FROM B);
It is so much slower that it's faster just to complete:
total = SELECT COUNT(*) FROM A;
quan = SELECT COUNT(*) FROM A where IDNUM IN (SELECT IDNUM FROM B);
print(total-quan)
But I can't figure out why the NOT IN
should be so slow.