1

I have a table called Aircraft and there are many records. The problem is that some are duplicates. I know how to select the duplicates and their counts:

SELECT flight_id, latitude, longitude, altitude, call_sign, measurement_time, COUNT(*)
FROM Aircraft
GROUP BY flight_id, latitude, longitude, altitude, call_sign, measurement_time
HAVING COUNT(*) > 1;

This returns something like:

enter image description here

Now, what I need to do is remove the duplicates, leaving just one each so that when I run the query again, all counts become 1.

I know that I can use the DELETE keyword, but I'm not sure how to delete it from the SELECT.

I'm sure I am missing an easy step, but I do not want to ruin my DB being a newbie.

How do I do this?

Superman
  • 221
  • 2
  • 11
pookie
  • 3,796
  • 6
  • 49
  • 105

3 Answers3

4
SELECT
    flight_id, latitude, longitude, altitude, call_sign, measurement_time
FROM Aircraft a
WHERE EXISTS (
    SELECT * FROM Aircraft x
    WHERE x.flight_id = a.flight_id
    AND x.latitude = a.latitude 
    AND x.longitude = a.longitude
    AND x.altitude = a.altitude
    AND x.call_sign  = a.call_sign
    AND x.measurement_time = a.measurement_time 
    AND x.id < a.id
 )
;

If the query above returns thecorrect rows (to be deleted) you can change it into a delete statement:


DELETE
FROM Aircraft a
WHERE EXISTS (
    SELECT * FROM Aircraft x
    WHERE x.flight_id = a.flight_id
    AND x.latitude = a.latitude 
    AND x.longitude = a.longitude
    AND x.altitude = a.altitude
    AND x.call_sign  = a.call_sign
    AND x.measurement_time = a.measurement_time 
    AND x.id < a.id
 )
;
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • So this works, as does systemjack's answer. However, your answer deletes 79 fewer rows. I don't understand why that would be. – pookie Mar 16 '17 at 19:57
  • He assumed {flight_id, measurement_time}would make a good candidate key. But probably there were a few ties. – wildplasser Mar 16 '17 at 20:03
0

If it's a one-time operation you can create a temp table with the same schema and then copy unique rows over like so:

insert into Aircraft_temp
select distinct on (flight_id, measurement_time) Aircraft.* from Aircraft

Then swap them out by renaming, or truncate Aircraft and copy the temp contents back (truncate Aircraft; insert into Aircraft select * from Aircraft_temp;).

Safer to rename Aircraft to Aircraft_old and Aircraft_temp to Aircraft so you keep your original data until you are sure things are correct. Or at least check that the number of rows in your count query above match the count of rows in the temp table before doing the truncate.

Update2: With a separate valid primary key (assuming it is called id) you can do a DELETE based on a self join like this:

delete from Aircraft using (
    select a1.id
    from Aircraft a1
    left join (select flight_id, measurement_time, min(id) as id from Aircraft group by 1,2) a2
    on a1.id = a2.id
    where a2.id is null
) as d
where Aircraft.id=d.id

This finds the minimum id (could do max too for the "latest") for each flight and identifies all the records from the full set having an id that is not the minimum (no match in the join). The unmatched ids are deleted.

systemjack
  • 2,815
  • 17
  • 26
  • Yeah, I have a primary key `id`. It wasn't included in the SQL dump, sorry. Also, duplicates are fine, it's normal here and I expect there to be duplicates. I want to essentially perform a clean-up of the table. – pookie Mar 16 '17 at 19:12
  • You could could add a UNIQUE constraint to the flight_id field then to make sure it can't happen in the future: http://stackoverflow.com/questions/469471/how-do-i-alter-a-postgresql-table-and-make-a-column-unique – systemjack Mar 16 '17 at 19:15
0

I have always used the CTE method in SQL SERVER. This allows you to define columns that you want to compare, once you have established what columns make up a duplicate, then you can assign a CTE value to it and then go back and cleanup the CTE values that are greater than 1. This is an example of duplicate checking that I do.

WITH CTE AS
(select  d.UID
    ,d.LotKey
    ,d.SerialNo
    ,d.HotWeight
    ,d.MarketValue
    ,RN = ROW_NUMBER()OVER(PARTITION BY d.HotWeight, d.serialNo, d.MarketValue order by d.SerialNo)
from LotDetail d
where d.LotKey = ('1~20161019~305')
)
DELETE FROM CTE WHERE RN <> 1

In my example I am looking at the LotDetail table where the d.hotweight and d.serial no are matching. if there is a match then the original gets CTE 1 and any duplicates get CTE 2 or greater depending on the amount of duplicates. Then you use the last DELETE statement to clear the entries that come up as duplicate. THis is really flexible so you should be able to adapt it to your issue.

Here is an example tailored to your situation.

WITH CTE AS
(select  d.Flight_ID
    ,d.Latitude
    ,d.Longitude
    ,d.Altitude
    ,d.Call_sign
            ,d.Measurement*
    ,RN = ROW_NUMBER()OVER(PARTITION BY d.Flight_ID, d.Latitude, d.Longitude, d.Altitude, d.Call_Sign, d.Measurement* order by d.SerialNo)
from Aircraft d
where d.flight_id = ('**INSERT VALUE HERE')
)
DELETE FROM CTE WHERE RN <> 1
Joe Resler
  • 101
  • 1
  • 2
  • 13
  • Hmm, thanks, but I don't know the flight Id's. There is almost a million rows, so it is not practical to insert the the flight ID each time. Is that what you intended, or did I miss something? – pookie Mar 16 '17 at 19:41
  • Ah I missed the postgres part. Yeah this won't be an option, however to answer your question @Pookie, you don't have to know the flight_id, that was an example of what you could keep on. You don't actually have to key on anything, but if you had these with date stamps and wanted to do a range of dates, then you could use that instead. – Joe Resler Mar 16 '17 at 21:03
  • @JoeResler Okay, thanks for the help and the clarification :) – pookie Mar 16 '17 at 21:59