Find duplicates takes a long itme

Question

I'm having the following table layout: four different tables, each containing around 10 to 15 million entries. Three string attributes of each table are the same (let's call them Id, Name1, Name2). Now we want to read all entries having the same Id column but different (Name1,Name2) tuples. It is estimated that less than 0.5 % of all entries are matching.

We've created a view AllEntries (basically a UNION ALL of the relevant attributes over all four tables) and our query looks like this:

SELECT *
FROM AllEntries
GROUP BY Id
HAVING COUNT(DISTINCT(Name1)) > 1 OR COUNT(DISTINCT(Name2)) > 1

Executing the query in our test database with 2 million entries in each table (i.e. 8 million entries in the view) already takes around 2 to 3 minutes (nice server).

Q: Is there any performance improvement possible to improve the performance?

Might be more efficient to have a distinct query on your source tables before doing the union on them. — 500 - Internal Server Error, Mar 20 '14 at 14:53
The best solution might depend on whether you'll do this regularly, how often the data changes, whether you can make schema changes, etc. First thing that springs to my mind, for example, is to create a new (indexed) column which is a numeric hash of the concatenation of name1 and name2, then match on that as the first step. (Don't worry about hash collisions; you can still match on the actual data as a second step, but you'll have a much smaller dataset then.) — Matt Gibson, Mar 20 '14 at 14:53
@KarlKieninger Pretty sure you cannot create an index on union view http://stackoverflow.com/questions/751838/create-an-index-on-sql-view-with-union-operators-will-it-really-improve-perform — paparazzo, Mar 20 '14 at 16:55

score 2 · Answer 1 · answered Mar 20 '14 at 15:05

Try a CTE with ROW_NUMBER() instead of the traditional GROUP BY/HAVING approach:

;with cteDups as
(
    Select  *
            ,ROW_NUMBER() Over(Partition By Name1 Order By Id) rn1
            ,ROW_NUMBER() Over(Partition By Name2 Order By Id) rn2
    From    AllEntries
)
Select  *
From    cteDups
Where   rn1 > 1
    Or  rn2 > 1

Gordon Linoff · Answer 2 · 2014-03-20T17:44:10.967

count(distinct) is more resource intensive than other aggregation functions. You could try:

SELECT *
FROM AllEntries
GROUP BY Id
HAVING min(Name1) <> max(Name1) or min(Name2) <> max(Name2);

If you build indexes on id, Name1 and id, Name2 in each of the subtables, the following should show a significant performance improvement:

select ae.*
from AllEntries ae
where exists (select 1 from subtable1 ae2 where ae.id = ae2.id and ae.Name1 <> ae2.Name1) or
      exists (select 1 from subtable2 ae2 where ae.id = ae2.id and ae.Name1 <> ae2.Name1) or
      . . .

These are split into subqueries to encourage the optimizer to use a different index on each one.

Look at the second. Pretty sure that is not the syntax you mean. — paparazzo, Mar 20 '14 at 17:28

score 0 · Answer 3 · answered Mar 20 '14 at 14:58

This will largely depend on your indices, but the last statement with the OR condition is certainly not ideal for a table of this size. Also, I'm not sure why you need a SELECT * for this... extra IO. Avoid it if you can.

Try something like this....

SELECT id, COUNT(name1)
FROM {table}
GROUP BY id
HAVING COUNT(*) > 1

UNION ALL


SELECT id, COUNT(name2)
FROM {table}
GROUP BY id
HAVING COUNT(*) > 1

UNION ALL

etc

This will let you leverage the index on ID and also avoid COUNT DISTINCT, which is a very expensive function as a rule.

If you want to get those specific records, I'd recommend writing a CTE and joining the result of the query above with the data itself.... return all the ids and the names for which count(*) is higher.

For as few records as you expect to be duplicated in some way, another option is just

SELECT id, COUNT(*)
FROM {table}
GROUP BY id
HAVING COUNT(*) > 1

and then join that with your datatable using id.... that'd avoid the extra work with the union all stuff, and would show you every record for which ID is duplicated. To me that would be preferable anyway... you probably don't want duplicate ID's :-)

OP is not looking for dups across all names. Even if that was the case eliminating count(*) > 1 would break on same value once in more than 1 table. — paparazzo, Mar 20 '14 at 16:51

paparazzo · Answer 4 · 2014-03-20T17:33:47.353

That UNION is killing it
There is no use of indexes on name1 or name2 after a union
Cannot create an index on a view with UNION

You are going to think this is whacked out but give it a try
It uses indexes and favors a small number of matches
If you don't have indexes on word1 and word2 then do so

select distinct ta.ID 
  from t1 as ta
  join t1 as tb
    on ta.ID = tb.ID 
   and ( (ta.word1 <> tb.word1) or (ta.word2 <> tb.word2) )
union 
select distinct ta.ID 
  from t1 as ta
  join t2 as tb
    on ta.ID = tb.ID 
   and ( (ta.word1 <> tb.word1) or (ta.word2 <> tb.word2) )
union 
  t1 t3
union 
  t1 t4
union 
  t2 t2
union 
  t2 t3
union 
  t2 t4
union 
  t3 t3
union 
  t3 t4
union
  t4 t4

Find duplicates takes a long itme

4 Answers4