1

The original "why" of what I want to do is:

Restore a table maintaining its original partitioning instead of it all going into today's partition.

What I thought I could do is bq load to a temporary table. Then run a query to split that table into one table per day YYYYMMDD in the naming convention needed by bq partition i.e. sharded_YYYYMMDD. Then run bq partition.

This page https://cloud.google.com/bigquery/docs/creating-partitioned-tables gives examples but it requires running a query per day. That could be hundreds:

bq query --use_legacy_sql=false --allow_large_results --replace \ --noflatten_results --destination_table 'mydataset.temps$20160101' \ 'SELECT stn,temp from `bigquery-public-data.noaa_gsod.gsod2016` WHERE mo="01" AND da="01" limit 100'

So how do I make a single query that will iterate over all the days and make one table per day?

I found a similar question here Split a table into multiple tables in BigQuery SQL but there is no answer about doing it with a single query.

Community
  • 1
  • 1

2 Answers2

3

The main problem here is having full scan for each and every day. The the rest is less of a problem and can be easily scripted out in any client of your choice

So, below is to - How avoid full table scan for each and every day?

Try below step-by-step to see the approach
It is generic enough to extend/apply to your real case - meantime I am using same example as you in your question and I am limiting exercise to just 10 days

Step 1 – Create Pivot table
In this step we a) compress each row’s content into record/array and b) put them all into respective ”daily” column

#standardSQL
SELECT
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160101' THEN r END) AS day20160101,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160102' THEN r END) AS day20160102,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160103' THEN r END) AS day20160103,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160104' THEN r END) AS day20160104,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160105' THEN r END) AS day20160105,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160106' THEN r END) AS day20160106,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160107' THEN r END) AS day20160107,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160108' THEN r END) AS day20160108,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160109' THEN r END) AS day20160109,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160110' THEN r END) AS day20160110
FROM (
  SELECT d, r, ROW_NUMBER() OVER(PARTITION BY d) AS line
  FROM (
    SELECT 
      stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r
    FROM `bigquery-public-data.noaa_gsod.gsod2016` AS t 
    GROUP BY stn, d
  ) 
)
GROUP BY line

Run above query in Web UI with pivot_table (you can choose whatever name you want here) as a destination

As you can see - here we will get table with 10 columns – one column for one day and schema of each column is a copy of schema of original table:

enter image description here

Step 2 – Creating sharded table one-by-one ONLY scanning respective column (no full table scan)

#standardSQL
SELECT r.*
FROM pivot_table, UNNEST(day20160101) AS r

Run above query from Web UI with destination table named mytable_20160101

You can run same for next day

#standardSQL
SELECT r.*
FROM pivot_table, UNNEST(day20160102) AS r

Now you should have destination table named mytable_20160102 and so on
You should be able to automate/script this step with any client of your choice Note: those final daily tables will have exactly same schema as original table!

There are many variations of how you can use above approach - it is up to your creativity

Note: BigQuery allows up to 10000 columns in table, so 365 columns for respective days of one year is definitely not a problem here :o)

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
0

Answering myself here. Another approach I've seen done is to write a script that:

  1. Parses the tablebackup.json file, outputs many files tablebackuppartitionYYYYMMDD.json split on a provided parameter.

  2. Creates a batch script to bq load all the files into the appropriate table partitions.

The script would need to process row by row or chunks to be able to handle massive backups. And would take some time. The advantage of using this method is it would be generic and usable by an untrained-in-BQ sysadmin.