0

I'm trying to find a better way of finding duplicates in SQL Server. This took over 20 minutes to run with just over 300 million records before results started showing in the results window within SSMS. Another 22 minutes elapsed before it crashed.

Then SSMS threw this error after displaying 16,777,216 records:

An error occurred while executing batch. Error message is: Exception of type 'System.OutOfMemoryException' was thrown.

Schema:

ENCOUNTER_NUM - numeric(22,0)
CONCEPT_CD - varchar(50)
PROVIDER_ID - varchar(50)
START_DATE - datetime
MODIFIER_CD - varchar(100)
INSTANCE_NUM - numeric(18,0)


SELECT
    ROW_NUMBER() OVER (ORDER BY f1.[ENCOUNTER_NUM],f1.[CONCEPT_CD],f1.[PROVIDER_ID],f1.[START_DATE],f1.[MODIFIER_CD],f1.[INSTANCE_NUM]),
    f1.[ENCOUNTER_NUM], 
    f1.[CONCEPT_CD], 
    f1.[PROVIDER_ID], 
    f1.[START_DATE], 
    f1.[MODIFIER_CD], 
    f1.[INSTANCE_NUM]
FROM
    [dbo].[I2B2_OBSERVATION_FACT] f1
    INNER JOIN [dbo].[I2B2_OBSERVATION_FACT] f2 ON
        f1.[ENCOUNTER_NUM] = f2.[ENCOUNTER_NUM] 
        AND f1.[CONCEPT_CD] = f2.[CONCEPT_CD]
        AND f1.[PROVIDER_ID] = f2.[PROVIDER_ID]
        AND f1.[START_DATE] = f2.[START_DATE]
        AND f1.[MODIFIER_CD] = f2.[MODIFIER_CD]
        AND f1.[INSTANCE_NUM] = f2.[INSTANCE_NUM]
JustBeingHelpful
  • 18,332
  • 38
  • 160
  • 245

1 Answers1

8

Not sure how much faster this is, but worth a try.

SELECT
    COUNT(*) AS Dupes,
    f1.[ENCOUNTER_NUM], 
    f1.[CONCEPT_CD], 
    f1.[PROVIDER_ID], 
    f1.[START_DATE], 
    f1.[MODIFIER_CD], 
    f1.[INSTANCE_NUM]
FROM
    [dbo].[I2B2_OBSERVATION_FACT] f1
GROUP BY
    f1.[ENCOUNTER_NUM], 
    f1.[CONCEPT_CD], 
    f1.[PROVIDER_ID], 
    f1.[START_DATE], 
    f1.[MODIFIER_CD], 
    f1.[INSTANCE_NUM]
HAVING
    COUNT(*) > 1
EkoostikMartin
  • 6,831
  • 2
  • 33
  • 62
  • with a suitable index it should be pretty fast – Phil May 02 '13 at 17:58
  • 2
    +1, but I would use `COUNT(*)` instead of `COUNT(1)` ... it's more explicit to what you're doing... counting rows. – Matthew May 02 '13 at 18:01
  • +1 A 300M x 300M join is no joke unless you run it as a merge join, which requires a very specific index. This should be much faster! – Andomar May 02 '13 at 18:02
  • I should clarify. Before SSMS could start displaying results, it took 20 minutes. It's still trying to print everything in the results window. – JustBeingHelpful May 02 '13 at 18:07