0

I have a task to write SQL to sanitise and de-duplicate an imported data-set.

 pk1 | pk2 | timestamp | value1 | value2 | value3 | etc
-----+-----+-----------+--------+--------+--------+----- 

  1  |  2  |    123    |   1    |   2    |   5    |  ...
  1  |  2  |    124    |   1    |   2    |   4    |  ...
  1  |  2  |    125    |   1    |   2    |   3    |  ...   Either this row
  1  |  2  |    125    |   1    |   2    |   2    |  ...   Or this row (arbitrary)

  3  |  2  |    123    |   1    |   2    |   5    |  ...
  3  |  2  |    123    |   1    |   2    |   4    |  ...
  3  |  2  |    124    |   1    |   2    |   3    |  ...
  3  |  2  |    125    |   1    |   2    |   2    |  ...   Only this row

The two pk fields are the composite primary key.

The timestamp field identifies when the data was generated.

I need one row per pk1, pk2, with highest timestamp taking priority. There can still be duplicates (1, 2, 125 appears twice in the set above), at which point an arbitrary row should be selected and a field set so signify that this was an arbitrary selection.

I have answers for MySQL and RDBMS that support ANALYTICAL_FUNCTIONS()...


MySQL :

SELECT
  import.*,
  CASE WHEN COUNT(*) = 1 THEN 0 ELSE 1 END   AS AS duplicate_warning
FROM
  import
INNER JOIN
(
  SELECT pk1, pk2, MAX(timestamp) AS timestamp
    FROM import
GROUP BY pk1, pk2
)
  AS import_lookup
    ON  import_lookup.pk1       = import_lookup.pk1
    AND import_lookup.pk2       = import_lookup.pk2
    AND import_lookup.timestamp = import_lookup.timestamp
GROUP BY
  import.pk1,
  import.pk2

ANALYTICAL_FUNCTIONS() :

SELECT
  sorted_import.*
FROM
(
  SELECT
    import.*,
    CASE WHEN
      COUNT(*)       OVER (PARTITION BY pk1, pk2, timestamp) = 1
      AND
      MAX(timestamp) OVER (PARTITION BY pk1, pk2)            = timestamp
    THEN
      0
    ELSE
      ROW_NUMBER() OVER (PARTITION BY pk1, pk2 ORDER BY timestamp DESC)
    END  AS duplicate_warning
  FROM
    import
)
  AS sorted_import
WHERE
  sorted_import.duplicate_warning IN (0, 1)


How can this be accomplished using SQLite?

One restriction (I don't make these rules) : Can't use temp tables or auto-increment fields.

MatBailie
  • 83,401
  • 18
  • 103
  • 137

1 Answers1

3

In SQLite 3.7.11 or later, the values of non-aggregated columns are guaranteed to come from a row that matches a single MIN or MAX:

SELECT *, MAX(timestamp)
FROM import
GROUP BY pk1, pk2
CL.
  • 173,858
  • 17
  • 217
  • 259
  • Hmm, that should mean that the MySQL example is compatible with SQLite. I'll check the version, thanks for confirming this. – MatBailie Jun 25 '14 at 16:59
  • Does MySQL guarantee this, or does it just take the values from the last row in the group? – CL. Jun 25 '14 at 17:00
  • Not sure, will investigate, but in my example above I use a sub-query to isolate the max-timestamp. I just need to *then* pick an arbitrary row from that group. Your description for SQLite suggests that SQLite doesn't need the sub-query, but having that sub-query there does cover my arse in the case of arbitrary MySQL behaviour ;) – MatBailie Jun 25 '14 at 17:07
  • I'll also have to investigate what happens in SQLite for `SELECT table.*, MIN(x), MAX(x) FROM table GROUP BY table.pk`... – MatBailie Jun 25 '14 at 17:11