1

A bug has arisen in my code which has allowed duplicate entries to appear in a database table, but with different casing. I am using MySQL.

e.g In this example, the second row shouldn't have been added, but it was.

    ID    Name          Description
-------------------------------------
    1    HELLO WORLD    Saying hello
    2    Hello world    Saying hello

To remedy the incorrect data, I want to delete the row where the full name is capitalised whereever a duplicate has occured. Can I do regex in SQL, is that the best solution, and how would I do it?

The table, in full, is as follows

'ADDRESSBOOK_GROUP', 'CREATE TABLE ADDRESSBOOK_GROUP ( ID int(10) unsigned NOT NULL default ''0'', NAME varchar(255) collate utf8_unicode_ci NOT NULL, DESCRIPTION text collate utf8_unicode_ci, ADDRESSBOOK_ID int(10) default NULL, TYPE int(10) unsigned NOT NULL default ''1'', PRIMARY KEY (ID), KEY ADDRESSBOOK_ID (ADDRESSBOOK_ID) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=DYNAMIC'

Edit - Also I forget to mention earlier that I only want the duplicate checks to occur when ADDRESSBOOK_ID is '6';

Andy A
  • 4,191
  • 7
  • 38
  • 56
  • Which RDBMS? And show the table, too. – fancyPants Dec 12 '12 at 11:43
  • I've provided a simplified example of the table. – Andy A Dec 12 '12 at 11:56
  • So far I've looked at working out the SQL, but I'm unsure on the WHERE statement. I've no regex experience and have struggled to get a foothold in it. I presume its a fairly simple piece of regex once I know what I'm doing though (unless I don't need regex at all?). – Andy A Dec 12 '12 at 11:58

1 Answers1

5

the following selects all the upper case values where doubles exists

delete t.* from table_name t
inner join (
    select distinct t.id
    from table_name t
    left join table_name t2
        on t2.col_name = t.col_name
    where t.col_name LIKE BINARY UPPER(t.col_name)
) as a
    on a.id = t.id;

edit:

delete t.* from ADDRESSBOOK_GROUP t
inner join (
    select distinct t.ID
    from ADDRESSBOOK_GROUP t
    left join (
        select
            count(*) as n,
            NAME,
            ADDRESSBOOK_ID
        from ADDRESSBOOK_GROUP
        group by 
            NAME,
            ADDRESSBOOK_ID
    ) as t2
        on t2.name = t.name
        and t2.ADDRESSBOOK_ID = t.ADDRESSBOOK_ID
    left join ADDRESSBOOK_GROUP t_n
        on t_n.ID = t.ID
        and t_n.NAME REGEXP ('^[0-9]+$') 
    where 
        t.NAME like binary upper(t.NAME)
        and t_n.ID is null
        and t2.n > 1
) as a
    on a.ID = t.ID
where t.ADDRESSBOOK_ID = 6;
Julien May
  • 2,011
  • 15
  • 18
  • Thanks Julien, but I also need to check if the row already exists with a different casing. – Andy A Dec 12 '12 at 12:00
  • Also `LIKE` is not case sensitive! – fancyPants Dec 12 '12 at 12:03
  • @JulienMay, does this actually work for you? From what I can tell, MySQL doesn't allow a subselect on the same table you are deleting from. See this question for example: http://stackoverflow.com/questions/4471277/mysql-delete-from-with-subquery-as-condition – dan1111 Dec 12 '12 at 12:28
  • @dan1111: yes this works for me. the relevant part to get it working is to use a join as shown – Julien May Dec 12 '12 at 12:32
  • @JulienMay +1. I was just wondering because I tried to do something similar. But I did not use a join, so that must have been my problem. – dan1111 Dec 12 '12 at 12:34
  • Thanks Julien! This is pretty good so far, but it only appears to delete 1 offending row each time I run the query. I have many duplicates in my database I would like to delete with one piece of SQL. – Andy A Dec 12 '12 at 12:42
  • @AndyA finally :) -> fixed – Julien May Dec 12 '12 at 12:46
  • That looks great now! Just one more point that I forgot to mention before (My fault!). The check for duplicates only wants to apply to duplicates where tablename.ADDRESSBOOK_ID = 6. – Andy A Dec 12 '12 at 12:52
  • Mmm, Im confused. Your query is bringing up a row called '1'. However, even if I ensure there is no other row called '1' within the system, your query still wants to delete it. – Andy A Dec 12 '12 at 12:55
  • It appears that names which include numbers are being incorrectly deleted in your query. – Andy A Dec 12 '12 at 13:52
  • @AndyA: can you edit your question and add additional info what exactly should and what should not be removed. please clarify what to with numbers – Julien May Dec 12 '12 at 14:07
  • Hi Julien. The question is correct now. The only bit I forgot, is to say that duplicates are only not allowed when the addressbook_id of the table row is 6 (In other words, Im only deleting duplicate columns for the user with an id of 6 - so this needs to be checked in the comparison). But also note that, currently, your query appears to incorrectly delete rows with a numerical name. – Andy A Dec 12 '12 at 14:49
  • @AndyA: not sure if i understood your numeric problem... however see edit for your self. – Julien May Dec 12 '12 at 15:58
  • Thanks again Julien, but your most recent query does not work. It only deletes numeric group names. To clarify the numeric problem with your original query ... imagine I have the following 4 group names in my table ... "HELLO, hello, 1, Goodbye". I would expect only 'HELLO' to be deleted with the query. However, both 'HELLO' and '1' are being deleted. It appears to me that group names which contain numbers are being wrongly deleted. – Andy A Dec 12 '12 at 16:21
  • Hi! We are nearly there I think, but there is still an issue. Its confused me for a bit - but the problem actually appears to be that ANY group name that exists in capital letters is being deleted, regardless of whether its a duplicate or not. – Andy A Dec 13 '12 at 09:57
  • @AndyA: find the modified version in consideration of the address book – Julien May Dec 13 '12 at 10:11
  • Fantastic. That looks like its perfect :). Big Green Tick and +1 for you! – Andy A Dec 13 '12 at 10:32