1

I'm trying to find and remove MySQL row duplicates that are right after each other, instead of finding all, even if they're not straight after each other.

SELECT DISTINCT(content) AS contentMsg, COUNT(*) AS cnt, `ticketId`,`date` 
FROM ticketsReplies 
WHERE username = 'X' 
GROUP BY contentMsg, ticketId 
HAVING cnt > 1 
ORDER BY cnt DESC

This is my current code. However, this finds duplicates if there's just two of the same answers in one ticket instead of them having to be IDs right after each other (which can happen if you send a POST request, and it fails, and you refresh etc).

How would I go about finding ones that are only 1 ID from each other.

So finding e.g. 1,2,3,4,5,6,7 instead of 1,3,9,11

E.g. if you have

ID                     EMAIL                
---------------------- -------------------- 
1                      aaa                  
2                      bbb                  
3                      bbb                  
4                      bbb                  
5                      ddd                  
6                      eee                  
7                      aaa                  
8                      aaa                  
9                      bbb

If you have this, it should find the following IDs: 2,3,4 but not 9 as it's not directly after 4 even though its a duplicate.

It should also find 7,8 but not 1 as they are not right after each other.

The Doe
  • 11
  • 2
  • Possible duplicate of [MySQL delete duplicate records but keep latest](https://stackoverflow.com/questions/6107167/mysql-delete-duplicate-records-but-keep-latest) – Masivuye Cokile Jul 09 '18 at 13:13
  • 1
    Note that DISTINCT is not a function, and it's unlikely ever to appear in an aggregated query (except within an aggregated function). – Strawberry Jul 09 '18 at 13:19
  • 1
    See: [Why should I provide an MCVE for what seems to me to be a very simple SQL query?](https://meta.stackoverflow.com/questions/333952/why-should-i-provide-an-mcve-for-what-seems-to-me-to-be-a-very-simple-sql-query) – Strawberry Jul 09 '18 at 13:19
  • And on top of that your GROUP BY is not ANSI valid meaning it will fail on MySQL server what have enabled``ONLY_FULL_GROUP_BY` in the sql_mode.. It will run on MySQL servers that don't have that sql_mode but will generated untrustable data for every column name in the SELECT which is not named in the GROUP BY in this case this is the column date.. – Raymond Nijland Jul 09 '18 at 13:23
  • @Strawberry I've updated my OP for better understanding. – The Doe Jul 09 '18 at 13:23
  • What MySQL version do you use? Use `SELECT VERSION()` query to find out. – Raymond Nijland Jul 09 '18 at 13:26
  • How many records are we talking here? It might be simpler to dump the entire table to a file or memory and do the analysis in code instead of in a query. – Patrick Q Jul 09 '18 at 13:28
  • @PatrickQ maybe - but that sounds fantastically unlikely to me. – Strawberry Jul 09 '18 at 13:36
  • @Strawberry Yes, unlikely. Just throwing it out as an option if we're talking like 100 records or something. Some simple code might be a better option than OP running a query they don't understand. – Patrick Q Jul 09 '18 at 13:38

2 Answers2

1

E.g.:

SELECT id 
  FROM 
     ( SELECT x.id  FROM my_table x JOIN my_table y ON y.email = x.email AND y.id = x.id + 1 ) a 
 UNION 
     ( SELECT y.id FROM my_table x JOIN my_table y ON y.email = x.email AND y.id = x.id + 1 );
Strawberry
  • 33,750
  • 13
  • 40
  • 57
  • I know OP doesn't mention it, but what happens if he has gaps in the ID column? Let's say `id=5 email=bbb` and the next one is `id=7 email=bbb` because he deleted `id=6`? – Michel Jul 09 '18 at 13:46
  • @Michel The OP would have to state how they wanted to deal with that, but either way, the fix would be fairly straightforward. – Strawberry Jul 09 '18 at 13:50
0

If there are gaps in your id list (eg 5, 6, 9, 11), simply comparing id = id+1 wouldn't work. The solution I came up with is to create two identical temporary tables with sequential row-numbers. In that case you can safely compare the rows based on their number, even if the id's have gaps.

DELETE FROM tab WHERE id IN (

SELECT A.id
FROM 
    (
    SELECT row_nr, id, email FROM (
        SELECT
            (@cnt1 := @cnt1 + 1) AS row_nr,
            t.id,t.email
        FROM tab AS t
          CROSS JOIN (SELECT @cnt1 := 0) AS d
        ORDER BY t.id
        ) x
    ) A
INNER JOIN 
    (
    SELECT row_nr, id, email FROM (
        SELECT
            (@cnt2 := @cnt2 + 1) AS row_nr,
            t.id,t.email
        FROM tab AS t
          CROSS JOIN (SELECT @cnt2 := 0) AS d
        ORDER BY t.id
        ) x
    ) B
ON A.row_nr-1 = B.row_nr AND A.email=B.email

)

The two (SELECT row_nr, id, email FROM ... ) x parts create two identical tables A and B like

row_nr   id  email
 1       1   aaa
 2       4   aaa
 3       5   bbb
 4       9   aaa
 5       11  aaa

Then you can compare the sequential row-nr's and email:

ON A.row_nr-1 = B.row_nr AND A.email=B.email

Selecting the result-id's gives you the id's 4, 11 which are the duplicates. Then you can delete those id's:

DELETE FROM tab WHERE id IN ( ... )

Here is a Fiddle to test the SELECT part.

NOTE: Before you try this at home, please backup your table!

Michel
  • 4,076
  • 4
  • 34
  • 52