Delete Oldest Duplicate Rows from a BigQuery Table

Question

I have a table with >70M rows of data and 2M of duplicates. I want to clean duplicates by keeping the recent original row.

I found a few solutions from here - link

In which, solutions are only to clean the duplicates and not retain the recent data among the duplicates.

here is another common solution:

;WITH cte 
     AS (SELECT Row_number() OVER (partition BY id ORDER BY 
                updatedAt 
                DESC, 
                status DESC) RN 
         FROM   MainTable) 
DELETE FROM cte 
WHERE  RN > 1

But it is not supported in BigQuery.

score 1 · Accepted Answer · answered Apr 24 '19 at 13:42

1

Here is the workaround, which replaces the existing table with unique rows and recent original rows.

CREATE OR REPLACE TABLE
  `MainTable` AS
SELECT
  id,
  acctId,
  appId,
  createdAt,
  startTime,
  subAcctId,
  type,
  updatedAt,
  userId
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY updatedAt DESC -- the first row among duplicates will be kept, other rows will be removed
      ) RN
  FROM
    `MainTable`)
WHERE
  RN = 1

Since we don't have the option to remove a particular column(rn), have to select the required columns while replacing the existing table.

Hope this helps someone. Please share if you have any better solutions.

answered Apr 24 '19 at 13:42

Bala.Raj

1,011
9
18

1

you can use `SELECT * EXCEPT(RN)` so you will not need to list all columns – Mikhail Berlyant Apr 24 '19 at 14:04
@MikhailBerlyant I think EXCEPT clause will not work directly in that way as per [this](https://www.geeksforgeeks.org/sql-except-clause/). Even if we use which will lead to another nested query. – Bala.Raj Apr 25 '19 at 08:25
i really doubt you did it properly!and what kind of error you got? – Mikhail Berlyant Apr 26 '19 at 13:29
oops Sorry got it. Thank you very much – Bala.Raj Apr 26 '19 at 13:35
see proper syntax in my answer :o) if this helped you - consider voting – Mikhail Berlyant Apr 26 '19 at 13:38

Mikhail Berlyant · Answer 2 · 2019-04-26T20:20:29.317

1

Below is for BigQuery Standard SQL

CREATE OR REPLACE TABLE
  `MainTable` AS
SELECT * EXCEPT(RN)
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY updatedAt DESC -- the first row among duplicates will be kept, other rows will be removed
      ) RN
  FROM
    `MainTable`)
WHERE
  RN = 1

edited Apr 26 '19 at 20:20

answered Apr 26 '19 at 13:37

Mikhail Berlyant

165,386
8
154
230

Delete Oldest Duplicate Rows from a BigQuery Table

2 Answers2