1

I have a database in which I have managed to get some duplicate rows.

The database contains the following values:

--------------------------------------------------------------
| id  | did    | sensorid | timestamp | data  | db_timestamp |
--------------------------------------------------------------
| int | string | int      | bigint    | jsonb | bigint       |
--------------------------------------------------------------

The timestamp and db_timestamp are unix timestamps (milliseconds since January 1, 1970 12:00:00 AM)

I have managed to get a lot of duplicate values and I need a fast method for "discarding" them. The values that show me that they are duplicates are the did, sensorid and the timestamp. This means that if I find a rows where these are the same, then they are duplicates.

I have made the following query to find the duplicates, now I just need to find out how to delete them.

SELECT did, sensorid, timestamp, COUNT(*)
FROM <db_name>
GROUP BY did, sensorid, timestamp
HAVING COUNT(*) > 1
ORDER BY sensorid, timestamp;

My db contains almost 1.200.000.000 rows and know I have way too many of these duplicates and therefore I have created a python script that will delete them for me. I have however managed to find out that using my python script doesn't go as fast as I need it to, so therefore I hope that using an SQL query to delete my duplicates can make do.

Here is a print out of the output from my above query:

----------------------------------------------------
|       did       | sensorid |  timestamp  | count |
|"358188056470108"|    910   |1492487704000|  61   |
|"358188056470108"|    910   |1492487705000|  61   |
----------------------------------------------------

I know that the did could be a bigint or alike to improve performance, but there's a reason for not doing so.

I want to delete all of the duplicates, however it is important that I keep the one of the rows, ideally the row with the lowest row id which will then be the "original" row.

I hope that someone here can help me create such a query.

Zeliax
  • 4,987
  • 10
  • 51
  • 79

1 Answers1

2

Use a CTE with a row_number() to identify the id's to delete, then delete them

with CTE as
(
select t1.*, row_number() over(partition by did, sensorid order by id) as rn
from MyTable t1
)
delete
from MyTable 
where id in (select id from CTE where rn > 1)
JohnHC
  • 10,935
  • 1
  • 24
  • 40
  • 1
    It appears OP does not have an unique `id` column which can be used to tell the duplicates apart (or rank them). (on second view: he does seem to have it, but he does not list it ...) – wildplasser Aug 30 '17 at 10:59
  • Each row has an unique id. I guess I forgot to mention that. The id column is a serial (automatically increment integer) – Zeliax Aug 30 '17 at 11:29