Spark: why is a subselect with "NOT IN" dramatically slower than "IN"?

Question

I have two tables in Apache Spark 2.x. Each table has a common row "IDNUM". Call them table A and table B.

This is fast in Apache SparkSQL:

SELECT COUNT(*) FROM A where IDNUM IN (SELECT IDNUM FROM B);

This is incredibly slow:

SELECT COUNT(*) FROM A where IDNUM NOT IN (SELECT IDNUM FROM B);

It is so much slower that it's faster just to complete:

total = SELECT COUNT(*) FROM A;
quan  = SELECT COUNT(*) FROM A where IDNUM IN (SELECT IDNUM FROM B);
print(total-quan)

But I can't figure out why the NOT IN should be so slow.

Do you have some sample benchmarks on how much slower it is? 2x, 10x? — Raj, Jan 25 '19 at 23:54
Please [edit] your question and provide a [mcve] - [How to make god reproducible Apache Spark examples](https://stackoverflow.com/q/48427185) provides a number of useful hints how to do it, including, but not limited to, providing the execution plan. In this specific case I'd also recommend explaining how data is loaded, how you benchmark, and including relevant configuration (at least configuration related to adaptive and cost based execution). — zero323, Jan 26 '19 at 10:39
I already noticed a few times that spark sometimes where really slow at performing NOT IN statement, while writing the equivalent query with a JOIN B ON A.IDNUM = B.IDNUM WHERE B.IDNUM is null could be much, much faster. You should try this to see if it is any faster for you too. — FurryMachine, Jan 27 '19 at 13:01

score 1 · Answer 1 · answered Jan 27 '19 at 01:32

I don't know about Apache Spark but it looks to me that the database has fast access to row/items in A and B by IDNUM, so a subselect or join that checks for equality can be processed very fast if the number of elements that pass the condition 'IDNUM IN (SELECT IDNUM FROM B)' is low. Getting the count of an entire table is a common case so it would be optimized to be fast.

'IDNUM NOT IN (SELECT IDNUM FROM B)' naively would require scanning the IDNUM column for all rows in A to check that they aren't in the set from B. As you are only doing a count you don't really need the IDNUM values from such rows but I guess the query optimizer isn't smart enough to transform the query into the difference between two counts.

Spark: why is a subselect with "NOT IN" dramatically slower than "IN"?

1 Answers1