How do I delete duplicates, and update the records that refer to those duplicates in SQL

Question

I have two tables:

User:(int id, varchar unique username)

Items: (int id, varchar name, int user_id)

currently, there are case insensitive duplicates in user table like:

1,John
2,john
3,sally
4,saLlY

and the Items table will then have

1,myitem,1
2,mynewitem,2
3,my-item,3
4,mynew-item,4

I've updated the code that inserts to user table to make sure it always inserts lowercase.

However, I need to migrate the database so that duplicates are removed from the user table, and the item table reference is updated so the user doesn't lose access to their items

I.E the data after migration will be:

User:

1,john
3,sally

Items

1,myitem,1
2,mynewitem,1
3,my-item,3
4,mynew-item,3

Since the user table has a unique constraint, i can't just set it to lower like

update public.user set username =lower(username)

first update the items so they all point tot he correct version of the user, then delete unwanted users. — Randy, Jan 31 '17 at 15:44
i can do that in Java or other programming language, I am wondering if it's possible to do this purely using SQL — user171943, Jan 31 '17 at 15:50
Add Row numbers partitioned by Username then delete where row number > 1. https://msdn.microsoft.com/en-us/library/ms186734.aspx Then http://stackoverflow.com/questions/2334712/update-from-select-using-sql-server — Snowlockk, Jan 31 '17 at 17:03

score 2 · Answer 1 · answered Jan 31 '17 at 21:19

The following code is tested with "H2 1.3.176 (2014-04-05) / embedded mode" on the web console. There are two queries that should solve the issue as you stated, and there is an additional preparation statement for considering a case that - though not shown in your data - should be considered, too. The preparation statement will be explained a little bit later; Let's start with the main two queries:

First, all items.userids will be rewritten to those of corresponding user entries with lower case name as follows: Let's call lower case entries main and non lower case entries dup. Then, every items.userid, which refers to a dup.id, will be set to a corresponding main.id. A main entry corresponds to a dup entry if a case-insensitive comparison of their names matches, i.e. main.name = lower(dup.name).

Second, all dup entries in the user table will be deleted. A dup entry is one where name <> lower(name).

So far the basic requirements. Additionally, we should consider that for some users there might exist only entries with upper case characters, but no "lower case entry". For dealing with this situation, a preparation statement is used, which sets - for each group of common names - one name out of each group to lowercase.

drop table if exists usr;

CREATE TABLE usr
    (`id` int primary key, `name` varchar(5))
;

INSERT INTO usr
    (`id`, `name`)
VALUES
    (1, 'John'),
    (2, 'john'),
    (3, 'sally'),
    (4, 'saLlY'),
    (5, 'Mary'),
    (6, 'mAry')

;

drop table if exists items;

CREATE TABLE items
    (`id` int, `name` varchar(10), `userid` int references usr (`id`))
;

INSERT INTO items
    (`id`, `name`, `userid`)
VALUES
    (1, 'myitem', 1),
    (2, 'mynewitem', 2),
    (3, 'my-item', 3),
    (4, 'mynew-item', 4)
;

update usr set name = lower(name) where id in (select min(ui.id) as minid from usr ui where lower(ui.name) not in (select ui2.name from usr ui2)
group by lower(name));

update items set userid =
(select umain.id as mainid from usr udupl, usr umain
 where umain.name = lower(umain.name)
     and lower(udupl.name) = lower(umain.name)
     and udupl.id = userid
);

delete from usr where name <> lower(name);

select * from usr;

select * from items;

Executing above statements yields the following results:

select * from usr;
ID  | NAME
----|-----
2   | john
3   | sally
5   | mary

select * from items;
ID | NAME     |USERID  
---|----------|------
1  |myitem    | 2
2  |mynewitem | 2
3  |my-item   | 3
4  |mynew-item| 3

score 2 · Answer 2 · answered Feb 01 '17 at 12:58

If you first update correctly the items references, then you can delete the users duplicates. In the following example I kept the users with the minimum id as the correct ones, if this doesn't bother you

--Prepare data
create TABLE #users  
(id int primary key, username varchar(15));

INSERT INTO #users
(id, username)
select
1, 'John'
union all select
2, 'john'
union all select
3, 'sally'
union all select
4, 'saLlY'
union all select
5, 'Mary'
union all select
6, 'mAry'


create TABLE #items  
(itemid int, name varchar(10), userid int references #users (id));

INSERT INTO #items
(itemid, name, userid)
select
1, 'myitem', 1
union all select
2, 'mynewitem', 2
union all select
3, 'my-item', 3
union all select
4, 'mynew-item', 4
;

--Update items
update #items 
set userid =minid 
from
 (
select minid,id from 
(
select min(id) as minid,lower(username) as newusername
from #users group by username) t inner join #users 
on t.newusername = username) t2 inner join #items on t2.id = userid


--delete duplicates users, according to minimum id
delete from #users where id not in (
select min(id) from #users group by lower(username))

--set the remaining users names to lower
update #users
set username = lower(username)

--Clean temp data
drop table #users
drop table #items

This was tested in sqlserver, but you asked for pure sql, so I think it will suits you

score 1 · Answer 3 · answered Jan 31 '17 at 19:12

Update Items first:

update items
set userid = u.userid
from items i
   inner join users u on i.iserid=u.userid
   inner join (select userid, username, row_number() over (partition by username order by userid)) u2 on u2.username=u.username and rn=1

then create new user table based off original:

select userid, lower(username) username 
into NewUserTable
from (select userid, username, row_number() over (partition by username order by userid)) u 
where rn=1

Not sure if H2 uses window functions. I'll leave this as a SQL Server solution. — KeithL, Jan 31 '17 at 19:18

score 1 · Answer 4 · answered Feb 01 '17 at 13:06

This code works perfect on SQL Server

Try it it will help you (you may need to simple changes to comply with your DB engine):-

SELECT U1.id,U2.id id2
INTO #User_Tmp
FROM User U1 JOIN User U2 
ON LOWER(U2.username) = LOWER(U1.username) 
AND U1.id < U2.id

UPDATE It
SET It.user_id = U.id
FROM Items It
JOIN #User_Tmp U
ON U.id2 = It.id

DELETE FROM User
WHERE id IN 
(
    SELECT id2 FROM #User_Tmp
)

SELECT *
FROM User

SELECT *
FROM Items

DROP TABLE #User_Tmp;

hope this Answers the question.

score 1 · Answer 5 · answered Feb 02 '17 at 13:16

BEGIN TRAN
CREATE TABLe #User (UserID Int, UserName Nvarchar(255))

INSERT INTO #USER
SELECT 1,'John' UNION ALL
SELECT 2,'John'  UNION ALL
SELECT 3,'sally' UNION ALL
SELECT 4,'saLlY'

CREATE TABLE #items  
(itemid int, name varchar(10), userid int );

INSERT INTO #items
(itemid, name, userid)
select
1, 'myitem', 1
union all select
2, 'mynewitem', 2
union all select
3, 'my-item', 3
union all select
4, 'mynew-item', 4

GO
WITH CTE (USERID, DuplicateCount)
AS
(
    SELECT UserName,
    ROW_NUMBER() OVER(PARTITION BY  UserName
    ORDER BY  UserName) AS DuplicateCount
    FROM #User

)
Delete from CTE Where DuplicateCount > 1

Select * from #User

Select * from #items

ROLLBACK TRAN

After removing the Duplicate Record you can simple Updates the table — Alfaiz Ahmed, Feb 02 '17 at 13:17

score 1 · Answer 6 · answered Feb 06 '17 at 12:34

Try out MERGE statement using this you can find out duplicate and also you can update the values of duplicates.

MERGE [INTO] <target table>

USING <source table or table expression>

ON <join/merge predicate> (semantics similar to outer join)

WHEN MATCHED <statement to run when match found in target>

WHEN [TARGET] NOT MATCHED <statement to run when no match found in target>

score 1 · Accepted Answer · answered Feb 07 '17 at 12:32

I'm not good at H2. You can try this writen for SQL Server and database case sensitive, accent sensitive.

create table t_user(id int not null identity(1,1), username varchar(25) unique);
alter table t_user add constraint pk_id_user primary key(id);

create table t_items(id int not null identity(1,1), name varchar(25), user_id int);
alter table t_items add constraint pk_id_items primary key(id);
alter table t_items add constraint fk_user_id foreign key(user_id) references t_user(id);

insert into t_user (username) values ('John'), ('john'), ('sally'), ('saLlY');
insert into t_items (name, user_id) values ('myitem', 1), ('mynewitem', 2), ('my-item', 3), ('mynew-item',4);

select * from t_user
select * from t_items

create table t_user_mig(id int not null identity(1,1), username varchar(25) unique);
alter table t_user_mig add constraint pk_id_user_mig primary key(id);

create table t_items_mig(id int not null identity(1,1), name varchar(25), user_id int);
alter table t_items_mig add constraint pk_id_items_mig primary key(id);
alter table t_items_mig add constraint fk_user_id_mig foreign key(user_id) references t_user_mig(id);

insert into t_user_mig select distinct lower(username) from t_user
insert into t_items_mig
select ti.name, (select id from t_user_mig where username = lower(tu.username)) 
from t_items ti, t_user tu 
where ti.user_id = tu.id

select * from t_user_mig
select * from t_items_mig

I replace your tables user, items by t_user, t_items. These tables are migrated to t_user_mig, t_items_mig.

You can try it in H2. I'll appreciate your feedback.

I hope it can help.

How do I delete duplicates, and update the records that refer to those duplicates in SQL

7 Answers7