2

I have two tables with the same column name: CookieID (bigint)

  1. table (Cookies) has 5 million records
  2. table (CookiePurpose) has 10 million records

I want to create a new table, with only the CookieIDs from #1 that do not exist in #2.

The following approach took over 20 minutes.

CREATE TABLE #KnownCookieIDs

(
    [CookieID] [bigint] NOT NULL
)

INSERT INTO 
#KnownCookieIDs
SELECT CookieID FROM CookiePurpose
EXCEPT
SELECT CookieID from Cookies

Is there a faster way to achieve this?

Update

I was testing on an Azure database with a S0 tier plan, which has 10 DTUs. Figured the long query time was because of this, so I am in the process of exporting the DB to my local machine, and will try out your suggestions from there.

Kenci
  • 4,794
  • 15
  • 64
  • 108

2 Answers2

2

This query should work faster:

CREATE TABLE #KnownCookieIDs

(
    [CookieID] [bigint] NOT NULL
)

INSERT INTO 
    #KnownCookieIDs
SELECT 
    c.CookieID 
FROM 
    Cookie c
    LEFT JOIN CookiePurpose cp ON c.CookieID = cp.CookieID
WHERE
    cp.CookieID IS NULL
progm
  • 2,782
  • 3
  • 14
  • 32
  • 1
    I think your tables are backwards – UnhandledExcepSean Jun 26 '18 at 13:33
  • Except for using the wrong table as @UnhandledExcepSean said, this should be faster, especially if you have an index on CookieID on both tables. OTOH, you're already getting 500,000 records/minute. It's possible that you have reached a system performance limit (CPU/dIsk, etc.). See: https://stackoverflow.com/questions/2686254/how-to-select-all-records-from-one-table-that-do-not-exist-in-another-table – Terry Carmen Jun 26 '18 at 13:42
  • 1
    I said nothing in regards to the performance. The issue I see in your SQL is that, they want all Cookie records missing a corresponding CookiePurpose record, yours does the opposite. – UnhandledExcepSean Jun 26 '18 at 13:46
  • LEFT JOIN is apparently still much faster than EXCEPT: https://www.sqlshack.com/t-sql-commands-performance-comparison-not-vs-not-exists-vs-left-join-vs-except/ – Terry Carmen Jun 26 '18 at 13:48
  • @UnhandledExcepSean Thank you. The query of the owner of the question specifies a different sequence than in the text. I changed the tables as described in the description. – progm Jun 26 '18 at 14:05
0

So, I decided to do tad of experimentation. The result on a data set 1/10th the size of the OP's showed a sub-select performed better than a left join. Although, I would expect the left join to work the best in larger data sets. UPDATE: I set up a dataset like the OP and the LEFT JOIN still didn't out perform the sub-select, but it was very close.

Ensuring there are indexes on the two tables is critical for speed. So make sure your tables are indexed appropriately.

/*
DROP TABLE Cookie
CREATE TABLE Cookie ([CookieID] [bigint] NOT NULL IDENTITY(1,1) PRIMARY KEY,Dummy bit)

DROP TABLE CookiePurpose
CREATE TABLE CookiePurpose (CookiePurposeID [bigint] NOT NULL IDENTITY(1,1) PRIMARY KEY, [CookieID] [bigint] NOT NULL)
CREATE NONCLUSTERED INDEX IX_CookiePurpose_CookieID ON dbo.CookiePurpose (CookieID)

SET NOCOUNT ON

DECLARE @CookieCounter BIGINT=(SELECT COUNT(CookieID) FROM Cookie)
WHILE @CookieCounter<500000 BEGIN INSERT INTO Cookie(Dummy) SELECT 1; SET @CookieCounter=@CookieCounter+1; END

DECLARE @CookiePurposeCounter BIGINT=(SELECT COUNT(CookieID) FROM CookiePurpose)
WHILE @CookiePurposeCounter<1000000 BEGIN INSERT INTO CookiePurpose(CookieID) SELECT Round(( ( 1000000000 - 1 - 1 ) * Rand() + 1 ), 0); SET @CookiePurposeCounter=@CookiePurposeCounter+1; END

*/

--EXCEPT: 300ms @ 1/10 data; 2917ms for full data
DECLARE @start DATETIME2,@end DATETIME2; SET @start=GETDATE()
DECLARE  @KnownCookieIDs TABLE ([CookieID] [bigint] NOT NULL)
INSERT INTO @KnownCookieIDs
    SELECT CookieID FROM CookiePurpose
    EXCEPT
    SELECT CookieID from Cookie
SET @end=GETDATE(); SELECT DATEDIFF(ms,@start,@end) AS [ms elapsed];
GO

--LEFT JOIN: 123ms @ 1/10 data; 1093ms for full data
DECLARE @start DATETIME2,@end DATETIME2; SET @start=GETDATE()
DECLARE  @KnownCookieIDs TABLE ([CookieID] [bigint] NOT NULL)
INSERT INTO @KnownCookieIDs
    SELECT c.CookieID
    FROM Cookie c
    LEFT JOIN  CookiePurpose cp ON cp.CookieID=c.CookieID
    WHERE cp.CookieID IS NULL
SET @end=GETDATE(); SELECT DATEDIFF(ms,@start,@end) AS [ms elapsed];
GO

--sub-select: 113 ms @ 1/10 data; 1046ms for full data
DECLARE @start DATETIME2,@end DATETIME2; SET @start=GETDATE()
DECLARE  @KnownCookieIDs TABLE ([CookieID] [bigint] NOT NULL)
INSERT INTO @KnownCookieIDs
    SELECT c.CookieID
    FROM Cookie c
    WHERe c.CookieID NOT IN (
        SELECT CookieID FROM CookiePurpose
    )
SET @end=GETDATE(); SELECT DATEDIFF(ms,@start,@end) AS [ms elapsed];
GO


--Delete: 767ms @ 1/10 data; 8450ms for full data
DECLARE @start DATETIME2,@end DATETIME2; SET @start=GETDATE()
DECLARE  @KnownCookieIDs TABLE ([CookieID] [bigint] NOT NULL)
INSERT INTO @KnownCookieIDs
    SELECT c.CookieID
    FROM Cookie c

DELETE c
FROM @KnownCookieIDs c
INNER JOIN CookiePurpose cp ON cp.CookieID=c.CookieID

SET @end=GETDATE(); SELECT DATEDIFF(ms,@start,@end) AS [ms elapsed];
GO
UnhandledExcepSean
  • 12,504
  • 2
  • 35
  • 51