What's the best way to get intersected data from one table?

Question

Suppose I have the below table

enter image description here

CREATE TABLE [dbo].[TestData](
    [ID] [bigint] NOT NULL,
    [InstanceID] [int] NOT NULL,
    [Field] [int] NULL,
    [UserID] [bigint] NOT NULL
) ON [PRIMARY]

GO
INSERT [dbo].[TestData] ([ID], [InstanceID], [Field], [UserID]) 
VALUES (1, 1, NULL, 1000),(2, 1, NULL, 1002),(3, 1, NULL, 1000),
    (4, 1, NULL, 1003),(5, 2, NULL, 1002), (6, 2, NULL, 1005),
    (7, 2, NULL, 1006),(8, 2, NULL, 1007),(9, 3, NULL, 1002),
    (10, 3, NULL, 1006),(11, 3, NULL, 1009),(12, 3, NULL, 1010),
    (13, 1, NULL, 1006),(14, 2, NULL, 1002),(15, 3, NULL, 1003)
GO

I search for the best practice to write a query to get the full rows of intersected data between two instances using UserID

For example the intersected UserIDs between InstanceID 1 and 2 are ( 1002 , 1006 ), to get the results I wote the query in two different ways as below :

Select * From TestData
Where UserID in 
( 
    Select T1.UserID From TestData T1 Where InstanceID = 1
        Intersect
    Select T2.UserID From TestData T2 Where InstanceID = 2
)
and InstanceID in (1,2) Order By 1

Second

Select * From TestData
Where UserID in 
( 
    Select Distinct T1.UserID 
    From TestData T1 join TestData T2 on T1.UserID = T2.UserID
    Where T1.InstanceID = 1 and T2.InstanceID = 2
)
and InstanceID in (1,2) Order By 1

So the results will be

enter image description here

Is one of the above queries is the best way to get the results ??

score 0 · Answer 1 · edited May 23 '17 at 11:57

0

Using EXISTS is better than using IN. When using the IN subquery, the entire resultset is processed. With EXISTS, it just searches as they are found to match. As far as your question, I think the INTERSECT implementation just simply does the join anyways so there shouldn't be a difference.

EDIT: a post Here says that for IN vs EXISTS, the optimizer will treat them the same as well (as of 2008). So pretty much my guess as well as what I just read boils down to :They will perform the same because the optimizer knows.

edited May 23 '17 at 11:57

Community

1
1

answered Apr 22 '13 at 23:28

Scotch

3,186
11
35
50

During the simplication phase the query optimizer does not always treat IN and EXISTS in the same way. Even while testing the above code, I received two different plans when testing between EXISTS and IN. However, when I created a PRIMARY KEY on the ID field, then I got identical results from the query optimizer with no additional changes to the code. EXISTS is definately the safer of the two. – Registered User Apr 22 '13 at 23:50
I guess I just assumed a primary key on the ID field. I was leaning towards `EXISTS`, but I figured that them there quuuury optimizationers were smart enough to know what you mean. – Scotch Apr 23 '13 at 02:33

score 0 · Answer 2 · answered Apr 22 '13 at 23:44

Here's an example of the query if you were to use EXISTS statements:

SELECT * 
FROM TestData td
WHERE td.InstanceID IN (1, 2)
AND EXISTS
    (SELECT 1
    FROM TestData sub
    WHERE td.UserID = sub.UserID
    AND sub.InstanceID = 2)
AND EXISTS
    (SELECT 1
    FROM TestData sub
    WHERE td.UserID = sub.UserID
    AND sub.InstanceID = 1)
ORDER BY 1;

For the sample data provided, there was no noticable performance difference between any of the three solutions. However, I agree with Scotch that using EXISTS statements will help performance over IN statements under specific scenarios.

The best thing you can do to improve performance is create the table with a PRIMARY KEY. Setting the ID field as a PRIMARY KEY will bolster performance by 50% since the highest cost of your query is sorting the data.

score 0 · Answer 3 · answered Apr 23 '13 at 01:51

0

You can also do this with an aggregation and join:

select td.*
from TestData td join
     (select td.userid
      from TestData
      group by td.userId
      having sum(case when InstanceId = 1 then 1 else 0 end) > 0 and
             sum(case when InstanceId = 2 then 1 else 0 end) > 0
    ) td2
    on td.userid = td2.userid

The advantage to the aggregation is that the having clause makes it very flexible in terms of the conditions you can represent. Performance will be best if you have an index on userId, InstanceId.

answered Apr 23 '13 at 01:51

Gordon Linoff

1,242,037
58
646
786

are you sure that using aggregation will be best in performance ? as i think that using 'intersect' will be better .. could you explain please ? – Amr Badawy Apr 23 '13 at 05:40
@AmrBadawy . . . No, I am not sure which gives the best performance. For that, you would need to test the different solutions. I prefer the aggregation solution because it solves "set-in-set" problems in general (and this is just one example of that type of problem). – Gordon Linoff Apr 23 '13 at 13:29

Aleksandr Fedorenko · Answer 4 · 2013-04-24T07:06:07.587

The script is used by two operations of Index seek and one operation of Distinct sorting.

SELECT ID, InstanceID, Field, UserID
FROM [dbo].[TestData] t
WHERE InstanceID IN(1, 2) 
  AND EXISTS (
              SELECT 1
              FROM [dbo].[TestData] t2
              WHERE InstanceID IN(1, 2) AND t.UserID = t2.UserID
              HAVING COUNT(DISTINCT t2.InstanceID) = 2
              )
ORDER BY t.ID

OR

;WITH cte AS
 (
  SELECT ID, InstanceID, Field, UserId
         ,COUNT(*) OVER(PARTITION BY InstanceID, UserID) AS cntInstanceUser
  FROM [dbo].[TestData] t
  WHERE InstanceID IN(1, 2)
  )
  SELECT c.ID, c.InstanceID, c.Field, c.UserID
  FROM cte c
  WHERE EXISTS (
                SELECT 1
                FROM cte c2 
                WHERE c2.UserId = c.UserID
                HAVING COUNT(*) != c.cntInstanceUser
                )
  ORDER BY c.ID

For improving performance use this index:

CREATE INDEX x ON [dbo].[TestData](InstanceID, UserID) INCLUDE(Id, Field)

Demo on SQLFiddle

What's the best way to get intersected data from one table?

4 Answers4