2

I am facing an issue related to a project of mine. Here is the summary of what i would like to do :

I have a big daily file (100 Go) with the following extract (no header) :

ID_A|segment_1
ID_A|segment_2
ID_B|segment_2
ID_B|segment_3
ID_B|segment_4
ID_B|segment_5
ID_C|segment_1
ID_D|segment_2
ID_D|segment_4

Every ID (from A to D) can be linked to one or multiple segments (from 1 to 5).

I would like to process this file in order to have the following result (the result file contains a header) :

ID|segment_1|segment_2|segment_3|segment_4|segment_5
ID_A|1|1|0|0|0
ID_B|0|1|1|1|1
ID_C|1|0|0|0|0
ID_D|0|1|0|1|0

1 means that the ID is included in the segment, 0 means that it is not.

I am using the following query to get the result :

select id,
       countif(segment = 'segment_1') as segment_1,
       countif(segment = 'segment_2') as segment_2,
       countif(segment = 'segment_3') as segment_3,
       countif(segment = 'segment_4') as segment_4,
       countif(segment = 'segment_5') as segment_5
from staging s cross join
     unnest(split(segments, ',')) as segment
group by id;

This solution worked for me until the number of segments became a lot higher (900+ segments instead of 5 in my first example). This is creating a huge query that cannot be passed as an argument through bq cli.

Is there any workaround that i can use ?

Thanks to all for your help.

Regards

muslash
  • 39
  • 6

1 Answers1

1

Below is for BigQuery Standard SQL

EXECUTE IMMEDIATE '''
SELECT id, ''' || (
  SELECT STRING_AGG("COUNTIF(segment = '" || segment || "') AS " || segment ORDER BY segment)
  FROM (SELECT DISTINCT segment FROM staging)  
) || '''  
FROM staging
GROUP BY 1
ORDER BY 1
'''     

If to apply to sample data in your question - output is

Row id      segment_1   segment_2   segment_3   segment_4   segment_5    
1   ID_A    1           1           0           0           0    
2   ID_B    0           1           1           1           1    
3   ID_C    1           0           0           0           0    
4   ID_D    0           1           0           1           0  

and as you can see you don't need to worry about number and naming of segments - it is taken care of by above query

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Hello Mikhail, this is exactly what i was looking for. Thank you! However, when using this script with `bq` cli, i have got this error (i would like to save the result into a new table) : `BigQuery error in query operation: Error processing job 'project:bqjob_r323e9fd198fe602c_0000017390d81787_1': configuration.query.destinationTable cannot be set for scripts` I think that i need to manage that in the script. Is there any way to do that ? Thank you again – muslash Jul 27 '20 at 15:29
  • 1
    see [INSERT statement](https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#insert_statement) - if still problem with it - post new question and we will be happy to help further – Mikhail Berlyant Jul 27 '20 at 15:43