Delete duplicate values in BigQuery without recreating a table

Question

I have a table in BigQuery that has >1M rows. I got a duplicate with the same ID and the same data completely. Need to delete one of the two duplicates without recreating the table or changing anything on that table. Please help, thanks.

I did try to just delete the id but it will delete the both rows. Need to try with row_number but not sure how.

*'without [...] changing the table'* – not temporarily either? If you add and remove in a single transaction no one should notice. — Aconcagua, Jul 26 '23 at 06:26
Initially thought about adding a temporary column, filling with unique ids (with some BigQuery equivalent to [here](https://stackoverflow.com/questions/9414826/sql-server-how-to-add-new-identity-column-and-populate-column-with-ids)) and delete those with (new) id not in the maxima of ids when grouping all the other columns. Though `select`ing `distinct` in a temporary table, dropping the old one and renaming the new one back might actually be faster, at very least much simpler to implement ;) — Aconcagua, Jul 26 '23 at 07:19
And once you've solved define a primary key or at least some unique constraint such that duplicates cannot re-enter any more. — Aconcagua, Jul 26 '23 at 07:21

score 0 · Answer 1 · answered Jul 26 '23 at 06:29

0

Consider below option (BigQuery Standard SQL)

CREATE OR REPLACE TABLE your_table AS 
SELECT *
FROM your_table t
QUALIFY ROW_NUMBER() OVER(PARTITION BY FORMAT('%t', t)) = 1;

answered Jul 26 '23 at 06:29

Mikhail Berlyant

165,386
8
154
230

How would I do this without creating or replacing the table? – Insecupa Jul 26 '23 at 06:57
i think it is not possible via delete – Mikhail Berlyant Jul 26 '23 at 15:54

ValNik · Answer 2 · 2023-07-26T15:35:11.370

Try this (tested for SQL Server. I don't know, is this work with BigQuery. Not works with Postgres. Only as example.)

--test data
create table test (id int,dt date,descr varchar(10));
insert into test values
 (1,'2023-07-26','test1-1')
,(1,'2023-07-26','test1-2')
,(1,'2023-07-26','test1-3')
,(2,'2023-07-26','test2-1')
,(3,'2023-07-26','test3-1')
;
-- query
with rng as (
     select * 
        ,row_number()over(partition by id,dt order by (select 1)) rn
     from test
)
delete from rng
where rn>1;

result

id	dt	descr
1	2023-07-26	test1-1
2	2023-07-26	test2-1
3	2023-07-26	test3-1

Delete duplicate values in BigQuery without recreating a table

2 Answers2