I have a database in which I have managed to get some duplicate rows.
The database contains the following values:
--------------------------------------------------------------
| id | did | sensorid | timestamp | data | db_timestamp |
--------------------------------------------------------------
| int | string | int | bigint | jsonb | bigint |
--------------------------------------------------------------
The timestamp
and db_timestamp
are unix timestamps (milliseconds since January 1, 1970 12:00:00 AM)
I have managed to get a lot of duplicate values and I need a fast method for "discarding" them. The values that show me that they are duplicates are the did
, sensorid
and the timestamp
. This means that if I find a rows where these are the same, then they are duplicates.
I have made the following query to find the duplicates, now I just need to find out how to delete them.
SELECT did, sensorid, timestamp, COUNT(*)
FROM <db_name>
GROUP BY did, sensorid, timestamp
HAVING COUNT(*) > 1
ORDER BY sensorid, timestamp;
My db contains almost 1.200.000.000 rows and know I have way too many of these duplicates and therefore I have created a python script that will delete them for me. I have however managed to find out that using my python script doesn't go as fast as I need it to, so therefore I hope that using an SQL query to delete my duplicates can make do.
Here is a print out of the output from my above query:
----------------------------------------------------
| did | sensorid | timestamp | count |
|"358188056470108"| 910 |1492487704000| 61 |
|"358188056470108"| 910 |1492487705000| 61 |
----------------------------------------------------
I know that the did could be a bigint
or alike to improve performance, but there's a reason for not doing so.
I want to delete all of the duplicates, however it is important that I keep the one of the rows, ideally the row with the lowest row id
which will then be the "original" row.
I hope that someone here can help me create such a query.