0

I need to delete duplicate lines, leaving only the most updated ones

e.g.: SCHEMA: user_id, conversation_id, created_at, updated_at

DATA SAMPLE:

Row user_id conversation_id          created_at                  updated_at
1    234        45664        2020-06-05 16:17:01 UTC    2020-06-05 23:27:01 UTC        
2    234        45664        2020-06-05 16:17:01 UTC    2020-06-06 14:34:01 UTC
3    232        45665        2020-06-05 17:23:41 UTC    2020-06-05 18:12:01 UTC
4    232        45665        2020-06-05 17:23:41 UTC    2020-06-07 12:47:01 UTC

I want to keep only the most updated row:

   Row user_id conversation_id          created_at                  updated_at

    1    234        45664        2020-06-05 16:17:01 UTC    2020-06-06 14:34:01 UTC
    2    232        45665        2020-06-05 17:23:41 UTC    2020-06-07 12:47:01 UTC

I tried:

DELETE * FROM my_table
WHERE conversation_id IN (SELECT conversation_id FROM my_table
HAVING count(conversation_id) = 2
GROUP BY 1)
AND min(updated_at)
Lucas Pressi
  • 21
  • 1
  • 5

1 Answers1

0

Check https://stackoverflow.com/a/45311051/132438:

CREATE OR REPLACE TABLE `deleting.deduplicating_table`
AS
SELECT k.*
FROM (
  SELECT ARRAY_AGG(row ORDER BY updated_at DESC LIMIT 1)[OFFSET(0)] k 
  FROM `deleting.deduplicating_table` row
  GROUP BY user_id, conversation_id
)
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325