deleting duplicate row with no unique identifier

Question

I have some data in a table that looks roughly like the following:

table stockData
(
tickId int not null,
timestamp datetime not null,
price decimal(18,5) not null
)

Neither tickId nor timestamp are unique, however the combination of tickId and timestamp is supposed to be unique.

I have some duplicate data in my table, and I'm attempting to remove it. However, I'm coming to the conclusion that there is not enough information with the given data for me to discern one row from the other, and basically no way for me to delete just one of the duplicate rows. My guess is that I will need to introduce some sort of identity column, which would help me to identify one row from the other.

Is this correct, or is there some magic way of deleting one but not both of the duplicate data with a query?

EDIT Edited to clarify that tickId and timestamp combo should be unique, but it's not because of the duplicate data.

Just an idea, i don't know if it works or if it will delete both: DELETE FROM stackData WHERE tickId IN (SELECT tickId FROM stackData); — , Mar 25 '12 at 18:08
From SQL Server 2005 there is a magic way to delete only one of the duplicates. What version of SQL server do you use? After you fix the duplicate issue you should add a primary key constraint to your table. — Mikael Eriksson, Mar 25 '12 at 18:13
Thanks Mikael, I'm using SQL Server 2005. I did have a unique constraint, however I have been consolidating data from several tables into a single table, and these tables had duplicate data. I figured the easiest way was to just dump everything into a single table, delete the dupes, and then add the constraint after. — steve8918, Mar 25 '12 at 18:15

score 29 · Accepted Answer · answered Mar 25 '12 at 18:14

29

Here is a query that will remove duplicates and leave exactly one copy of each unique row. It will work with SQL Server 2005 or higher:

WITH Dups AS
(
  SELECT tickId, timestamp, price,
    ROW_NUMBER() OVER(PARTITION BY tickid, timestamp ORDER BY (SELECT 0)) AS rn
  FROM stockData
)
DELETE FROM Dups WHERE rn > 1

answered Mar 25 '12 at 18:14

Sergey Rybalkin

3,004
22
26

wow, I've never seen anything like this before, but it looks great. I will read up on this and do some testing, thanks! – steve8918 Mar 25 '12 at 18:26
Nice answer from a first timer :) Welcome to stackoverflow. – Leigh Mar 25 '12 at 23:54

score 4 · Answer 2 · answered Mar 25 '12 at 18:10

4

select distinct * into temp_table from source_table (this table will be created for you)

delete from temp_table (what you don't need)

insert into sorce_table
select * from temp_table

answered Mar 25 '12 at 18:10

Jester

3,069
5
30
44

NuNn DaDdY · Answer 3 · 2012-03-25T18:23:49.923

Maybe I'm not understanding your question correctly, but if "tickId" and "timestamp" are guaranteed to be unique then how do you have duplicate data in your table? Could you provide an example or two of what you mean?

However, if you have duplicates of all three columns inside the table the following script may work. Please test this and make a backup of the database before implementing as I just put it together.

declare @x table 
(
    tickId int not null,
    timestamp datetime not null,
    price decimal(18,5) not null
)

insert into @x (tickId, timestamp, price)
select tickId,
    timestamp,
    price
from stockData
group by tickId,
         timestamp,
         price
having count(*) > 1

union 

select tickId,
       timestamp,
       price
from stockData
group by tickId,
         timestamp,
         price
having count(*) = 1

delete 
from stockData

insert into stockData (tickId, timestamp, price)
select tickId,
       timestamp,
       price
from @x

alter table stockData add constraint
    pk_StockData primary key clustered (tickid, timestamp)

sorry about that, I meant it "should be" unique, however it's not (because of the duplicate data). I'll update the question. — steve8918, Mar 25 '12 at 18:13

deleting duplicate row with no unique identifier

3 Answers3

Linked