2

I have BigQuery table with columns col1, col2, col3. I want to delete rows that have duplicated values in col1 but keep the last (The row that was last pushed to BigQuery). col2 and col3 does not have to be duplicates.

My main problem is that I cannot find the last column. I tried below query, ordering was not right. But when I just SELECT all the rows the ordering is from the oldest to the newest rows.

SELECT *, ROW_NUMBER() OVER (PARTITION BY col1) AS row_numb
FROM table

I saw other solutions but they ordered by some column timestamp/created_at that I do not have. I know that one solution would be to add column with timestamp and then order by it to get most recent row. But is there any other way?

Example:

col1 col2 col3
1 2 3
2 3 1
1 4 3

The last row (col1 = 1, col2 = 4, col3 = 3) was added last to BigQuery So what I want is to find duplicates in first column (That would be 1. and 3. row) and delete all the duplicates except the one that was added last to BigQuery (That is the 3. row).

The result would be

col1 col2 col3
2 3 1
1 4 3
Klemen Vrhovec
  • 189
  • 1
  • 12
  • "But is there any other way?" - No! There is no other way! - you just need to have rule of how to order rows to identify which one is last! – Mikhail Berlyant Oct 15 '21 at 17:27
  • Ok then I will just add another column with timestamp. I wanted to avoid additional columns because of size of the tables (mainly because of cost), since my tables will get quite large. I will leave the question open for now to see if anyone else will have any ideas – Klemen Vrhovec Oct 15 '21 at 17:32

0 Answers0