Find duplicates in a one to many relationship in SQL

Question

Problem

I have 2 tables:

Table tTag
idTag int
otherColumns

And

Table tTagWord
idTagWord int
idTag int
idWord int
position int

For example:

enter image description here

So each idTag will have multiple idTagWord (unknown number), the position is important too. I try to find the best way, for the performance, to find the duplicates.

A duplicate would be to have the same idWords in the same order (position) for 2 different idTag.

What I have tried

SELECT GROUP_CONCAT(DISTINCT tab.idTag SEPARATOR ',') INTO @idTagSet
FROM (  SELECT idTag,GROUP_CONCAT(idWord order by position ASC SEPARATOR ' ') AS Tag
        FROM tTagWord
        GROUP BY idTag) AS tab
INNER JOIN (SELECT idTag,GROUP_CONCAT(idWord order by position ASC SEPARATOR ' ') AS Tag
            FROM tTagWord
            GROUP BY idTag) AS tab2 ON tab.Tag = tab2.Tag
WHERE tab.idTag <> tab2.idTag;

The previous query returns a set of the duplicate idTags, so it works. But the performance is terrible. With 150 000 idTag, it already takes several minutes and the table will soon have millions of idTag.

I also tried something like this answer

select idTag, GROUP_CONCAT(idWord order by position ASC SEPARATOR '-') AS idWordSet
from tTagWord
group by idTag
Having COUNT(idWordSet) > 1;

But I can't seem to find a way. Any idea?

Why/how is "the position important"? And yes, the usage of GROUP_CONCAT is not appropriate here (or most anywhere except perhaps final output, but that's a digression) - indexes go by-by and extra computations increase significantly. — user2864740, Sep 16 '14 at 20:52
Think about the idTag as a group of words (idWord). "The bird" is different from "bird The". — dyesdyes, Sep 16 '14 at 20:53
So it is only a duplicate if the idWord *and* position are the same? (Make show *expected* output; and include possible counter-examples not-selected.) — user2864740, Sep 16 '14 at 20:54

score 2 · Accepted Answer · answered Sep 16 '14 at 21:04

2

How about trying two group bys?

SELECT words, count(*), group_concat(idtag) as tags
FROM (SELECT idTag, GROUP_CONCAT(idWord order by position ASC SEPARATOR ' ') AS words
      FROM tTagWord
      GROUP BY idTag
     ) t
GROUP BY words
HAVING count(*) > 1;

answered Sep 16 '14 at 21:04

Gordon Linoff

1,242,037
58
646
786

Seems obvious now that I see it. Thanks! – dyesdyes Sep 16 '14 at 21:09

score 2 · Answer 2 · answered Sep 16 '14 at 21:18

This type of query is sometimes known as relational division, there's a whole bunch of methods at https://www.simple-talk.com/sql/t-sql-programming/divided-we-stand-the-sql-of-relational-division/

One example is:

select
    t1.idTag as tag1,
    t2.IdTag as tag2
from
    tTagWord t1
        inner join
    tTagWord t2
        on t1.idWord = t2.idWord and
           t1.position = t2.position and
           t1.idTag < t2.idTag
group by
    t1.idTag,
    t2.idTag
having
    count(*) = (
        select
            count(*)
        from
            tTagWord t3
        where
            t3.idTag = t1.idTag
    ) and
    count(*) = (
        select
            count(*)
        from
            tTagWord t4
        where
            t4.idTag = t2.idTag
    );

Here's an example. I've put Gordon's query there too. They might have different performance characteristics.

Find duplicates in a one to many relationship in SQL

2 Answers2