Store only 1 values and remove the rest for same duplicated values in bigquery

Question

I have duplicated values in my data. However, from the duplicated values, i only want to store 1 values and remove the rest of same duplicated values. So far, I have found the solution where they remove ALL the duplicated values like this.

posts

Code:

  SELECT ID, a.date as date.A, b.date as date.B, 
       CASE WHEN a.date <> b.date THEN NULL END AS b.date
except(date.A)

FROM
    table1 a LEFT JOIN table2 b
 USING (ID)
WHERE date.A = 1

Sample input:

Sample output (Store only 1 values from the duplicated values and remove the rest):

NOTE: query might wrong as it remove all duplicated values.

in your data sample you have not a primary key .. so how you can try to delete a row and not the other with same value ??' .. update you data sample adding info for define a primary key — ScaisEdge, Jul 23 '20 at 08:57
edited with the real database. isnt the ID is primary key? and it will be other IDs with duplicated rows — swm, Jul 23 '20 at 09:01
looking to your sample .. i see alway the same value .. a primary key by definition must be unique .. so if you a valid primary key is easy delete duplicatedc row otherwise is not posssible .. — ScaisEdge, Jul 23 '20 at 09:08
Is it possible to do like ..remove * where date > 1 for each ID (?) ..sorry not a right query, but my idea is there — swm, Jul 23 '20 at 09:14
Your query is really hard to follow. You have used `except` incorrectly and the `case` expression makes no sense. — Gordon Linoff, Jul 23 '20 at 11:01

Alexandre Moraes · Accepted Answer · 2020-07-23T10:53:44.553

Considering your screenshot's sample data and your explanation. I understand that you want to remove duplicates from your table retaining only one row of unique data. Thus, I was able to create a query to select only one row of data ignoring the duplicates.

In order to select the rows without any duplicates, you can use SELECT DISTINCT. According to the documentation, it discards any duplicate rows. In addition to this method, CREATE TABLE statement will also be used to create a new table (or replace the previous one) with the new data without duplicates. The syntax is as follows:

CREATE OR REPLACE TABLE project_id.dataset.table AS
SELECT DISTINCT ID, a.date as date.A, b.date as date.B, 
       CASE WHEN a.date <> b.date THEN NULL END AS b.date
except(date.A)

FROM
    table1 a LEFT JOIN table2 b
USING (ID)
WHERE date.A = 1

And the output will be exactly the same as you shared in your question.

Notice that I used CREATE OR REPLACE, which means if you set project_id.dataset.table to the same path as the table within your select, it will replace your current table (in case you have the data coming from one unique table). Otherwise, it will create a new table with the specified new table's name.

score 1 · Answer 2 · answered Jul 23 '20 at 11:03

1

You can use aggregation. Something like this:

SELECT ANY_VALUE(a).*, ANY_VALUE(b).*
FROM table1 a LEFT JOIN
     table2 b
     USING (ID)
WHERE date.A = 1
GROUP BY id, a.date;

For each id/datecombination, this returns an arbitrary matching row froma/b`.

answered Jul 23 '20 at 11:03

Gordon Linoff

1,242,037
58
646
786

Store only 1 values and remove the rest for same duplicated values in bigquery

2 Answers2