0

T1 contains about 30 million rows, T2 contains about 100k rows

select a from T2 gives ('a1','a2','a3',...); (1 lakh rows)

When I use 100k constant values directly inside the in block the query returns result in 80 millisec. But, when I use nested select in the query, it takes like forever.

select a,b from T1 where a in ('a1','a2','a3', ...); (Constant Values inside in block)  
select a,b from T1 where a in (select a from T2); (Query instead of values)

Any Idea why is it happening? Also is there a better way to do so? Since T1 contains 30 million rows, Left Join also takes a lot of time.

My Actual Query is :

select a,b from t1 where (a,b) not in (select a,b from t2) and a in (select a from t1); 
James Z
  • 12,209
  • 10
  • 24
  • 44
Talha
  • 79
  • 1
  • 7
  • 1
    do an `explain` of the query. you'll probably see mysql's creating a temp table/using filesort, which takes time. and note for this kind of thing, you'd be better of just using a `join`. `select a,b from t1 join t2 on t1.whatever=t2.whatever` – Marc B Sep 06 '16 at 20:47
  • Thanks, Also considering the fact t1 can have 3crore rows and t2 some lakh rows. My actual query is : select a,b from t1 where (a,b) not in (select a,b from t2) and a in (select a from t1); I need an optimised one. – Talha Sep 06 '16 at 20:51
  • Your title talks about `NOT IN`, yet your code says `IN`? Which is it? – Rick James Sep 06 '16 at 20:53
  • See http://stackoverflow.com/questions/21633115/return-row-only-if-value-doesnt-exist?lq=1 for how to do the `NOT IN (SELECT ...)` better. – Barmar Sep 06 '16 at 20:53
  • I have added my actual query which contains both NOT IN and IN – Talha Sep 06 '16 at 20:57

2 Answers2

0

There is a third, better, way:

SELECT ...
    FROM T1
    JOIN T2  ON T1.a = T2.a;

(And be sure that there is an index on a in each table.)

IN ( SELECT ... ) is notoriously slow; avoid it.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • A simple inner join isn't a good substitute for `IN ( SELECT ... )` because the former won't do a semi-join properly. – Bill Karwin Sep 06 '16 at 21:00
0

The subquery is more expensive in MySQL because MySQL hasn't optimized that type of query very well. MySQL doesn't notice that the subquery is invarant, in other words the subquery on T2 has the same result regardless of which row of T1 is being searched. So you and I can see clearly that MySQL should execute the subquery once and use its result to evaluate each row of T1.

But in fact, MySQL naively assumes that the subquery is a correlated subquery that may have a different result for each row in T1. So it is forced to run the subquery many times.

You clarified in a comment that your query is actually:

select a,b from t1 
where (a,b) not in (select a,b from t2) 
    and a in (select a from t1); 

You should also know that MySQL does not optimize tuple comparison at all. Even if it should use an index, it will do a table-scan. I think they're working on fixing that in MySQL 8.

Your second term is unnecessary, because the subquery is selecting from t1, so obviously any value of a in t1 exists in t1. Did you mean to put t2 in the subquery? I'll assume that you just made an error typing.

Here's how I would write your query:

select a, b from t1
left outer join (select distinct a, b from t2) as nt2 
    on t1.a = nt2.a and t1.b = nt2.b
where nt2.a is null;

In these cases, MySQL treats the subquery differently because it appears in the FROM clause. It runs each subquery once and stores the results in temp tables. Then it evaluates the rows of t1 against the data in the temp tables.

The use of distinct is to make sure to do the semi-join properly; if there are multiple matching rows in t2, we don't want multiple rows of output from the query.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • Thanks, a lot. Yes the second term is not required, the query should be only the following : select a,b from t1 where (a,b) not in (select a,b from t2) – Talha Sep 06 '16 at 21:12
  • It is giving error : Column 'a' in field list is ambiguous – Talha Sep 06 '16 at 21:20
  • Qualify column names like `t1.a` to remove ambiguity. – Bill Karwin Sep 06 '16 at 21:21
  • Yes, it worked. But the query is still taking a lot of time. When I directly put the data from (select a,b from t2) i.e. the second part it runs very quickly. – Talha Sep 06 '16 at 21:26
  • Yes, subqueries are pretty costly in MySQL. Perhaps you should do two queries: the first to query `select a, b from t2` and the second to use the result as literal values. – Bill Karwin Sep 06 '16 at 21:34