10

What would be the best way to remove duplicates while merging their records into one?

I have a situation where the table keeps track of player names and their records like this:

stats
-------------------------------
nick     totalgames     wins   ...
John     100            40
john     200            97
Whistle  50             47
wHiStLe  75             72
...

I would need to merge the rows where nick is duplicated (when ignoring case) and merge the records into one, like this:

    stats
    -------------------------------
    nick     totalgames     wins   ...
    john     300            137
    whistle  125            119
    ...

I'm doing this in Postgres. What would be the best way to do this?

I know that I can get the names where duplicates exist by doing this:

select lower(nick) as nick, totalgames, count(*) 
from stats 
group by lower(nick), totalgames
having count(*) > 1;

I thought of something like this:

update stats
set totalgames = totalgames + s.totalgames
from (that query up there) s
where lower(nick) = s.nick

Except this doesn't work properly. And I still can't seem to be able to delete the other duplicate rows containing the duplicate names. What can I do? Any suggestions?

Joe
  • 1,378
  • 5
  • 20
  • 32
  • 1
    Does your stats table have a unique, numeric id field? – MothOnMars Aug 17 '13 at 05:09
  • No, it does not. The primary key is the nick column. I could make id's but then I'd have to change the software too... (I didn't make these to begin with, I joined later) so I think that might be out of option. – Joe Aug 17 '13 at 05:19
  • No problem - that just changes how I'd do the deletion. Answer forthcoming. – MothOnMars Aug 17 '13 at 05:20

4 Answers4

10

SQL Fiddle

Here is your update:

 UPDATE stats
 SET totalgames = x.games, wins = x.wins
 FROM (SELECT LOWER(nick) AS nick, SUM(totalgames) AS games, SUM(wins) AS wins
     FROM stats
      GROUP BY LOWER(nick) ) AS x
 WHERE LOWER(stats.nick) = x.nick;

Here is the delete to blow away the duplicate rows:

 DELETE FROM stats USING stats s2
 WHERE lower(stats.nick) = lower(s2.nick) AND stats.nick < s2.nick;

(Note that the 'update...from' and 'delete...using' syntax are Postgres-specific, and were stolen shamelessly from this answer and this answer.)

You'll probably also want to run this to downcase all the names:

 UPDATE STATS SET nick = lower(nick);

Aaaand throw in a unique index on the lowercase version of 'nick' (or add a constraint to that column to disallow non-lowercase values):

CREATE UNIQUE INDEX ON stats (LOWER(nick)); 
Community
  • 1
  • 1
MothOnMars
  • 2,229
  • 3
  • 20
  • 23
  • Wut. That update is almost identical to what I had already! I was so close :( Hmm, the thing is I already have a primary key on column nick - do you think getting rid of that and creating the unique index is necessary? – Joe Aug 17 '13 at 05:39
  • Well, it depends on how you want to handle those values. If you want to disallow non-lowercase values completly, you could add a constraint on that column ("alter table stats add CONSTRAINT downcase_nick CHECK ((nick)::text = LOWER((nick)::text))"). (That's what I would do.) If you want to allow mixed-case values as long as the downcased versions are unique to avoid the situation you're in, you could just update the existing index to use the lowercase value. – MothOnMars Aug 17 '13 at 05:53
  • Yeah, I suppose. I just decided to force the software to lowercase everything before it interacts with the DB. Thanks for the help. – Joe Aug 17 '13 at 05:55
  • 1
    You may still want to add the constraint I posted above. Your code may be doing the right thing, but it never hurts to have the extra check at the database level. I always like to protect against some other future developer changing or adding an input method that would allow dirty data. – MothOnMars Aug 17 '13 at 06:08
4

I think easiest way to do it in one query would be using common table expressions:

with cte as (
    delete from stats
    where lower(nick) in (
      select lower(nick) from stats group by lower(nick) having count(*) > 1
    )
    returning *
)
insert into stats(nick, totalgames, wins)
select lower(nick), sum(totalgames), sum(wins)
from cte
group by lower(nick);

As you see, inside the cte I'm deleting duplicates and returning deleted rows, after that inserting grouped deleted data back into table.

see sql fiddle demo

Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
3

It can all be done in one statement, using RETURNING.

-- The data
CREATE TABLE stats
        ( nick VARCHAR PRIMARY KEY
        , totalgames INTEGER NOT NULL DEFAULT 0
        , wins INTEGER NOT NULL DEFAULT 0
        );

INSERT INTO stats(nick, totalgames,wins) VALUES
 ( 'John', 100, 40) ,( 'john', 200, 97)
,( 'Whistle', 50, 47) ,( 'wHiStLe', 75, 72)
, ( 'Single', 42, 13 ) -- this person has only one record
        ;
SELECT * FROM stats;

-- The query:
WITH upd AS (
        UPDATE stats dst
        SET totalgames = src.totalgames
                , wins = src.wins
        FROM ( SELECT MIN(nick) AS nick -- pick the "lowest" nick as the canonical nick
                , SUM(totalgames) AS totalgames
                , SUM(wins) AS wins
                FROM stats
                GROUP BY lower(nick)
                ) src
        WHERE dst.nick = src.nick
        RETURNING dst.nick -- only the records that have been updated
        )
-- Delete the records that were NOT updated.
DELETE FROM stats del
WHERE NOT EXISTS (
        SELECT * FROM upd
        WHERE upd.nick = del.nick
        )
        ;

SELECT * FROM stats;

Output:

INSERT 0 5
  nick   | totalgames | wins 
---------+------------+------
 John    |        100 |   40
 john    |        200 |   97
 Whistle |         50 |   47
 wHiStLe |         75 |   72
 Single  |         42 |   13
(5 rows)

DELETE 2
  nick   | totalgames | wins 
---------+------------+------
 wHiStLe |        125 |  119
 john    |        300 |  137
 Single  |         42 |   13
(3 rows)
wildplasser
  • 43,142
  • 8
  • 66
  • 109
0

UPDATE stats SET totalgames=s.totalgames, wins=s.wins
FROM (SELECT lower(nick) AS nick,SUM(totalgames) AS totalgames,SUM(wins) AS wins FROM stats GROUP BY lower(nick))s WHERE lower(nick)=s.nick;
DELETE FROM stats WHERE
lower(nick) IN (SELECT lower(nick) FROM stats GROUP BY lower(nick) HAVING COUNT(*)>1)
AND NOT lower(nick) IN (SELECT first(nick) FROM stats GROUP BY lower(nick) should work.

rimrul
  • 141
  • 7