2

Suppose I have a table1 like this:

id | itemcode
-------------
1  | c1
2  | c2
...

And a table2 like this:

item | name
-----------
c1   | acme
c2   | foo
...

Would the following two queries return the same result set under every condition?

SELECT id, itemcode 
FROM table1 
WHERE itemcode IN (SELECT DISTINCT item 
                  FROM table2 
                  WHERE name [some arbitrary test])

SELECT id, itemcode 
FROM table1 
   JOIN (SELECT DISTINCT item 
          FROM table2
          WHERE name [some arbitrary test]) items 
         ON table1.itemcode = items.item

Unless I'm really missing something stupid, I'd say yes. But I've done two queries which boil down to this form and I am getting different results. There are some nested queries using WHERE IN, but for the last step I've noticed a JOIN is much faster. The nested queries are all entirely isolated so I don't believe they are the problem, so I just want to eliminate the possibility that I've got a misconception regarding the above.

Thanks for any insights.

EDIT

The two original queries:

SELECT imitm, imlitm, imglpt 
    FROM jdedata.F4101 
    WHERE imitm IN 
  (SELECT DISTINCT ivitm AS itemno 
       FROM jdedata.F4104 
       WHERE ivcitm IN 
    (SELECT DISTINCT ivcitm AS legacycode 
             FROM jdedata.F4104 
              WHERE ivitm IN 
      (SELECT DISTINCT tritm 
               FROM trigdata.F4101_TRIG)
    )
  )


SELECT orig.imitm, orig.imlitm, orig.imglpt 
    FROM jdedata.F4101 orig 
        JOIN 
        (SELECT DISTINCT ivitm AS itemno 
        FROM jdedata.F4104 
        WHERE ivcitm IN 
              (SELECT DISTINCT ivcitm AS legacycode 
              FROM jdedata.F4104
               WHERE ivitm IN 
                 (SELECT DISTINCT tritm 
                  FROM trigdata.F4101_TRIG))) itemns 
ON orig.imitm = itemns.itemno

EDIT 2

Although I still don't understand why the queries returned different results, it would seem our logic was flawed from the beginning since we were using the wrong columns in some parts. Mind that I'm not saying I made a mistake interpreting the queries as written above or had some typo, we just needed to select on some different stuff.

Normally I don't rest until I get to the bottom of things like these, but I'm very tired and am entering my first vacation since January that spans more than one day, so I can't really be bothered searching further right now. I'm sure the tips given here will come in handy later. Upvotes have been distributed for all the help and I've accepted Ypercube's answer, mostly because his comments have led me the furthest. But thanks all round! If I do find out more later, I'll try to remember pinging back in.

onedaywhen
  • 55,269
  • 12
  • 100
  • 138
G_H
  • 11,739
  • 3
  • 38
  • 82
  • What RDBMS are you using? `IN` with a sub query has known performance problems in MySQL as it [repeatedly re-evaluates the sub query](http://stackoverflow.com/questions/3417074/why-would-an-in-condition-be-slower-than-in-sql/3417190#3417190) but for SQL Server `IN/EXISTS` is an efficient way of doing a semi join. – Martin Smith Aug 24 '11 at 11:52
  • 1
    Your example queries should return the same results. – Jacob Aug 24 '11 at 11:53
  • It's an Oracle database. The difference in performance isn't huge, but enough to make me think it would be best to use the JOIN approach for the last step when the result sets get big. I've used the IN form since I constructed it following the steps I've put on paper to get the right result, so I'd expect that query to deliver what I need. But now I'm not entirely certain. – G_H Aug 24 '11 at 11:54
  • 1
    Could you post the query you are really running? – Jacob Aug 24 '11 at 11:58
  • Is `item` in `table2` nullable? – ypercubeᵀᴹ Aug 24 '11 at 12:03
  • 1
    You can also rewrite the query with `EXISTS`. – ypercubeᵀᴹ Aug 24 '11 at 12:04
  • These queries will result the same but Performance depends on RDBMS . – Gourav khanna Aug 24 '11 at 12:06
  • Ah, forgot about exists. Maybe a good way of cross-checking. Also, in my situation the item column would not be nullable, it's a primary key. – G_H Aug 24 '11 at 12:07
  • Good lord, EXISTS is showing me yet another result set. I'll see if I can post the original queries, this is driving me nuts. – G_H Aug 24 '11 at 12:10
  • 1
    I think a possible explanation for not showing identical results would be if `table1.itemcode` and `table2.item` are of CHAR type and there are empty strings in the tables. Is that the case? – ypercubeᵀᴹ Aug 24 '11 at 12:18
  • 1
    That's a good point! The tables do use CHAR instead of VARCHAR and there's some white space that we also need to trim when processing the results. Best lead so for, I'll go check for some particular rows. – G_H Aug 24 '11 at 12:23
  • @G_H: is any of the fields used with `IN` nullable or has empty strings as values? – ypercubeᵀᴹ Aug 24 '11 at 12:30
  • 1
    I don't think `CHAR` or `VARCHAR` makes any difference and haven't worked much with Oracle but from what I know `''` is equivalent to `NULL` and that changes the behaviour of `IN` vs. `JOIN` and `EXISTS`. – ypercubeᵀᴹ Aug 24 '11 at 12:32
  • @G_H let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/2804/discussion-between-ypercube-and-g-h) – ypercubeᵀᴹ Aug 24 '11 at 12:53
  • I'm at a customer right now, so I don't have too much time. But I'll try all suggestions here and do some more specific searching, then come back with any results. – G_H Aug 24 '11 at 13:42

5 Answers5

4

Since table2.item is not nullable, the 2 versions are equivalent. You can remove the distinct from the IN version, it's not needed. You can check these 3 versions and their execution plans:

SELECT id, itemcode FROM table1 WHERE itemcode IN
  ( SELECT item FROM table2 WHERE name [some arbitrary test] )

SELECT id, itemcode FROM table1 JOIN
  ( SELECT DISTINCT item FROM table2 WHERE name [some arbitrary test] )
  items ON table1.itemcode = items.item

SELECT id, itemcode FROM table1 WHERE EXISTS
  ( SELECT * FROM table2 WHERE table1.itemcode = table2.item 
                           AND (name [some arbitrary test]) )
ypercubeᵀᴹ
  • 113,259
  • 19
  • 174
  • 235
  • 2
    +1 - `DISTINCT` in `IN` or `EXISTS` subqueries make me bonkers. Thanks for pointing that out. – JNK Aug 24 '11 at 12:54
  • THey are equivalent only as long as you don't want to actually return any fields from the subquery in the result set, then you need a join. – HLGEM Aug 24 '11 at 13:18
1

Ideally I would want to see the differences between the result sets.
- Are you getting duplication of records
- Is one set always a sub-set of the other
- Does one set have both 'additional' and 'missing' records in comparison to the other?

That said, the logic should be equivilent. My best guess would be that you have some empty string entries in there; because Oracle's version of a NULL CHAR/VARCHAR is just an empty string. This can give very funky results if you're not prepared for it.

MatBailie
  • 83,401
  • 18
  • 103
  • 137
1

Both queries perform a semijoin i.e. no attributes from table2 appear in the topmost SELECT (the resultset).

To my eye, your first query is easiest to identify as a semijoin, EXISTS even more so. On the other hand, an optimizer would no doubt see it differently ;)

onedaywhen
  • 55,269
  • 12
  • 100
  • 138
1

You can also try to do a direct join to the second table

SELECT DISTINCT id, itemcode 
FROM table1 
INNER JOIN table2 ON table1.itemcode = table2.item   
WHERE name [some arbitrary test] )

You don't need distinct if item is primary key or unique

Exists and Inner Join should have the same execution speed, while IN is more expensive.

Bulat
  • 6,869
  • 1
  • 29
  • 52
1

I'd look for some data type conversion in there.

create table t_vc (val varchar2(6));
create table t_c (val char(6));

insert into t_vc values ('12345');
insert into t_vc values ('12345 ');

insert into t_c values ('12345');
insert into t_c values ('12345');

select t_c.val||':'
from t_c
where val in (select distinct val from t_vc);

select c.val||':'
from t_vc v join (select distinct val from t_c) c on v.val=c.val;
Gary Myers
  • 34,963
  • 3
  • 49
  • 74