I have a task to write SQL to sanitise and de-duplicate an imported data-set.
pk1 | pk2 | timestamp | value1 | value2 | value3 | etc
-----+-----+-----------+--------+--------+--------+-----
1 | 2 | 123 | 1 | 2 | 5 | ...
1 | 2 | 124 | 1 | 2 | 4 | ...
1 | 2 | 125 | 1 | 2 | 3 | ... Either this row
1 | 2 | 125 | 1 | 2 | 2 | ... Or this row (arbitrary)
3 | 2 | 123 | 1 | 2 | 5 | ...
3 | 2 | 123 | 1 | 2 | 4 | ...
3 | 2 | 124 | 1 | 2 | 3 | ...
3 | 2 | 125 | 1 | 2 | 2 | ... Only this row
The two pk
fields are the composite primary key.
The timestamp
field identifies when the data was generated.
I need one row per pk1, pk2
, with highest timestamp
taking priority. There can still be duplicates (1, 2, 125
appears twice in the set above), at which point an arbitrary row should be selected and a field set so signify that this was an arbitrary selection.
I have answers for MySQL and RDBMS that support ANALYTICAL_FUNCTIONS()...
MySQL :
SELECT
import.*,
CASE WHEN COUNT(*) = 1 THEN 0 ELSE 1 END AS AS duplicate_warning
FROM
import
INNER JOIN
(
SELECT pk1, pk2, MAX(timestamp) AS timestamp
FROM import
GROUP BY pk1, pk2
)
AS import_lookup
ON import_lookup.pk1 = import_lookup.pk1
AND import_lookup.pk2 = import_lookup.pk2
AND import_lookup.timestamp = import_lookup.timestamp
GROUP BY
import.pk1,
import.pk2
ANALYTICAL_FUNCTIONS() :
SELECT
sorted_import.*
FROM
(
SELECT
import.*,
CASE WHEN
COUNT(*) OVER (PARTITION BY pk1, pk2, timestamp) = 1
AND
MAX(timestamp) OVER (PARTITION BY pk1, pk2) = timestamp
THEN
0
ELSE
ROW_NUMBER() OVER (PARTITION BY pk1, pk2 ORDER BY timestamp DESC)
END AS duplicate_warning
FROM
import
)
AS sorted_import
WHERE
sorted_import.duplicate_warning IN (0, 1)
How can this be accomplished using SQLite?
One restriction (I don't make these rules) : Can't use temp tables or auto-increment fields.