Query performance WHERE clause contains IN (subquery)

Question

SELECT Trade.TradeId, Trade.Type, Trade.Symbol, Trade.TradeDate, 
       SUM(TradeLine.Notional) / 1000 AS Expr1
FROM   Trade INNER JOIN
             TradeLine ON Trade.TradeId = TradeLine.TradeId
WHERE  (TradeLine.Id IN
                      (SELECT     PairOffId
                        FROM          TradeLine AS TradeLine_1
                        WHERE      (TradeDate <= '2011-05-11')
                        GROUP BY PairOffId
                        HAVING      (SUM(Notional) <> 0)))
GROUP BY Trade.TradeId, Trade.Type, Trade.Symbol, Trade.TradeDate
ORDER BY Trade.Type, Trade.TradeDate

I am concerned about the performance of the IN in the WHERE clause when the table starts to grow. Does anyone have a better strategy for this kind of query? The number of records returned by the subquery grows much slower than the number of records in the TradeLine table. The TradeLine table itself grows at a rate of 10/day.

Thank you.

EDIT: I used the idea of moving the subquery from WHERE to FROM. I voted up on all answers that contributed to this new query.

   SELECT Trade.TradeId, Trade.Type, Trade.Symbol, Trade.TradeDate,   
          PairOff.Notional / 1000 AS Expr1
   FROM         Trade INNER JOIN
                  TradeLine ON Trade.TradeId = TradeLine.TradeId INNER JOIN
                      (SELECT     PairOffId, SUM(Notional) AS Notional
                        FROM          TradeLine AS TradeLine_1
                        WHERE      (TradeDate <= '2011-05-11')
                        GROUP BY PairOffId
                   HAVING (SUM(Notional) <> 0)) AS PairOff ON TradeLine.Id = PairOff.PairOffId
   ORDER BY Trade.Type, Trade.TradeDate

Can't you reverse the order? Run the subquery as the outer query, then the outer query as the subquery. — soandos, May 11 '11 at 20:43
Yes, though not a lot. I was looking at is as an if(a) if(b) where the outer loop takes more time than the inner. so reversing them gives the same result, but faster — soandos, May 11 '11 at 20:53
@Soandos - it's not like that. The outer query is limited based on the results of the inner query. If you reverse them you will not get an identical result set. — JNK, May 11 '11 at 20:56
Why not? Effectively, I am testing for two conditions. I am just changing the order in which I am checking. I am not excluding any valid data... — soandos, May 11 '11 at 20:57
Not analogus. its the Intersection of a and b = Intersection of b and a, and that is true — soandos, May 11 '11 at 21:00
The only reasonable answer to this question is to set up test data and look at the execution plans. You may well find that `in` performs better than your rewrite! — Martin Smith, May 12 '11 at 12:37

score 6 · Accepted Answer · answered May 11 '11 at 20:54

The subquery in the IN clause does not depend on anything in the outer query. You can safely move it into FROM clause; a sane query plan builder would do it automatically.

Also, calling EXPLAIN PLAN on any query you're going to use in production is a must. Do it and see what the DBMS thinks of the plan for this query.

score 2 · Answer 2 · answered May 11 '11 at 20:49

I'm a fan of temp tables when a sub-query starts returning too large a result set.

So your where clause would just be

Where TradeLine.Id In (Select PairOffId From #tempResults)

and #tempResults would be defined as (warning: syntax is from memory, which means there may be errors)

Select PairOffId Into #tempResults
From TradeLine
Where (TradeDate <= @TradeDate) 
  //I prefer params in case the query becomes a StoredProc
Group By PairOffId
Having (Sum(Notional) <> 0)

score 1 · Answer 3 · edited Mar 06 '16 at 22:03

I have faced same issue with hundreds of thousands of records in XXXXXX DB. In my code i want to retrieve hierarchy (node which contain at least one child) nodes from all nodes.

The initial query written which is very slow.

  SELECT SUPPLIER_ID, PARENT_SUPPLIER_ID,
  FROM SUPPLIER
  WHERE 
    SUPPLIER_ID != PARENT_SUPPLIER_ID
    OR 
    SUPPLIER_ID   IN
      (SELECT DISTINCT PARENT_SUPPLIER_ID
       FROM SUPPLIER
       WHERE SUPPLIER_ID != PARENT_SUPPLIER_ID
      );

Then re-written to

  SELECT a.SUPPLIER_ID, a.PARENT_SUPPLIER_ID,
  FROM SUPPLIER a
  LEFT JOIN
  (SELECT DISTINCT PARENT_SUPPLIER_ID
  FROM SUPPLIER
  WHERE SUPPLIER_ID != PARENT_SUPPLIER_ID
  ) b
  ON a. SUPPLIER_ID     = b.PARENT_SUPPLIER_ID
  WHERE a. SUPPLIER_ID != a.PARENT_SUPPLIER_ID
     OR a. SUPPLIER_ID     = b.PARENT_SUPPLIER_ID;

score 1 · Answer 4 · edited Mar 06 '16 at 22:02

I have 2 suggestion you can try:

1). use Exists since you don't need get data from subquery, like this:

where exists ( select 1 from TradeLine AS TradeLine_1 where TradeLine.Id = TradeLine_1.PairOffId -- continue with your subquery ... )

2). main query join to your subquery, for instance

... join ( your_subquery) on your_subquery.PairOffId = TradeLine.Id

I believe these 2 ways can achieve better performance than "In" operation.

score -1 · Answer 5 · answered May 11 '11 at 20:54

-1

Using an IN will essentially force you to do a table scan. When your table grows, your execution time grows. Also you are running that query for each record returned. Would be easier to use a scalar select as a table:

SELECT t.TradeId, t.Type, t.Symbol, t.TradeDate, 
       SUM(TradeLine.Notional) / 1000 AS Expr1
FROM   Trade t,
(SELECT     TradeId, PairOffID
                        FROM          TradeLine AS TradeLine_1
                        WHERE      (TradeDate <= '2011-05-11')
                        GROUP BY PairOffId
                        HAVING      (SUM(Notional) <> 0)) tl       
WHERE  t.TradeId = tl.TradeId
  and  t.id <> tl.PairOffID
GROUP BY Trade.TradeId, Trade.Type, Trade.Symbol, Trade.TradeDate
ORDER BY Trade.Type, Trade.TradeDate

answered May 11 '11 at 20:54

cmutt78

861
1
9
18

3

-1 for `Using an IN will essentially force you to do a table scan`. This is not at all accurate. – JNK May 11 '11 at 20:57
1

`IN` is just another type of Join (a semi join). In `MySQL` it is indeed implemented badly and [repeatedly re-evaluated as a correlated sub query row by row even when uncorrelated](http://stackoverflow.com/questions/3417074/why-would-an-in-condition-be-slower-than-in-sql/3417190#3417190) but not in SQL Server. Your proposed rewrite changes the semantics also. – Martin Smith May 11 '11 at 21:02
There is no `t.id` in the schema?! – Andrew Lazarus May 11 '11 at 21:10

Query performance WHERE clause contains IN (subquery)

5 Answers5

Linked