28

In June the BQ team announced support for date-partitioned tables. But the guide is missing how to migrate old non-partitioned tables into the new style.

I am looking for a way to update several or if not all tables to the new style.

Also outside of DAY type partitioned what other options are available? Does the BQ UI show this, as I wasn't able to create such a new partitioned table from the BQ Web UI.

Pentium10
  • 204,586
  • 122
  • 423
  • 502
  • 2
    for anyone using R, here's a gist that automatically generates the code required for the accepted answer: https://gist.github.com/anonymous/8e6541433b345c798df20b638ae98b3a#file-gistfile1-txt hope that saves someone some time – rmg Dec 08 '17 at 13:07
  • The above gist was also developed further in a blog post at RStudio: https://rviews.rstudio.com/2018/02/02/cost-effective-bigquery-with-r/ – rmg Feb 05 '18 at 10:44
  • https://stackoverflow.com/questions/49754400/migrating-non-partitioned-streaming-table-to-partitioned-table-bigquery – Logan Aug 01 '18 at 08:11
  • 1
    See the following question to see how to get around this limit by partitioning by week/month/year: https://stackoverflow.com/a/56125049/132438. You'll also be able to partition with only one scan. – Felipe Hoffa May 14 '19 at 07:29

6 Answers6

31

What works for me is the following set of queries applied directly in the big query (big query create new query).

CREATE TABLE (new?)dataset.new_table PARTITION BY DATE(date_column) AS SELECT * FROM dataset.table_to_copy;

Then as the next step I drop the table:

DROP TABLE dataset.table_to_copy;

I got this solution from https://fivetran.com/docs/warehouses/bigquery/partition-table using only step 2

GuenPa
  • 311
  • 3
  • 2
17

from Pavan’s answer: Please note that this approach will charge you the scan cost of the source table for the query as many times as you query it.


from Pentium10 comments: So suppose I have several years of data, I need to prepare different query for each day and run all of it, and suppose I have 1000 days in history, I need to pay 1000 times the full query price from the source table?

As we can see - the main problem here is on having full scan for each and every day. The rest is less of a problem and can be easily scripted out in any client of the choice

So, below is to - How to partition table while avoid full table scan for each and every day?

Below step-by-step shows the approach

It is generic enough to extend/apply to anyone real use-case - meantime I am using bigquery-public-data.noaa_gsod.gsod2017 and I am limiting "exercise" to just 10 days to keep it readable

Step 1 – Create Pivot table
In this step we
a) compress each row’s content into record/array
and
b) put them all into respective ”daily” column

#standardSQL
SELECT
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170101' THEN r END) AS day20170101,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170102' THEN r END) AS day20170102,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170103' THEN r END) AS day20170103,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170104' THEN r END) AS day20170104,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170105' THEN r END) AS day20170105,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170106' THEN r END) AS day20170106,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170107' THEN r END) AS day20170107,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170108' THEN r END) AS day20170108,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170109' THEN r END) AS day20170109,
  ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170110' THEN r END) AS day20170110
FROM (
  SELECT d, r, ROW_NUMBER() OVER(PARTITION BY d) AS line
  FROM (
    SELECT 
      stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r
    FROM `bigquery-public-data.noaa_gsod.gsod2017` AS t 
    GROUP BY stn, d
  ) 
)
GROUP BY line  

Run above query in Web UI with pivot_table (or whatever name is preferred) as a destination

As we can see - here we will get table with 10 columns – one column for one day and schema of each column is a copy of schema of original table:

enter image description here

Step 2 – Processing partitions one-by-one ONLY scanning respective column (no full table scan) – inserting into respective partition

#standardSQL
SELECT r.*
FROM pivot_table, UNNEST(day20170101) AS r

Run above query from Web UI with destination table named mytable$20160101

You can run same for next day

#standardSQL
SELECT r.*
FROM pivot_table, UNNEST(day20170102) AS r

Now you should have destination table as mytable$20160102 and so on

enter image description here

You should be able to automate/script this step with any client of your choice

There are many variations of how you can use above approach - it is up to your creativity

Note: BigQuery allows up to 10000 columns in table, so 365 columns for respective days of one year is definitely not a problem here :o) Unless there is a limitation on how far back you can go with new partitions – I heard (but didn’t have chance to check yet) there is now no more than 90 days back

Update

Please note: Above version has a little extra logic of packing all aggregated cells into as least final number of rows as possible.

ROW_NUMBER() OVER(PARTITION BY d) AS line
and then
GROUP BY line
along with
ARRAY_CONCAT_AGG(…)
does this

This works well when row size in your original table is not that big so final combined row size still will be within rows size limit that BigQuery has (which I believe is 10 MB as of now)

If your source table already has row size close to that limit – use below adjusted version

In this version – grouping is removed such that each row has only value for one column

#standardSQL
SELECT
    CASE WHEN d = 'day20170101' THEN r END AS day20170101,
    CASE WHEN d = 'day20170102' THEN r END AS day20170102,
    CASE WHEN d = 'day20170103' THEN r END AS day20170103,
    CASE WHEN d = 'day20170104' THEN r END AS day20170104,
    CASE WHEN d = 'day20170105' THEN r END AS day20170105,
    CASE WHEN d = 'day20170106' THEN r END AS day20170106,
    CASE WHEN d = 'day20170107' THEN r END AS day20170107,
    CASE WHEN d = 'day20170108' THEN r END AS day20170108,
    CASE WHEN d = 'day20170109' THEN r END AS day20170109,
    CASE WHEN d = 'day20170110' THEN r END AS day20170110
FROM (
    SELECT 
        stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r
    FROM `bigquery-public-data.noaa_gsod.gsod2017` AS t 
    GROUP BY stn, d
)
WHERE d BETWEEN 'day20170101' AND 'day20170110'

As you can see now - pivot table (sparce_pivot_table) is sparse enough (same 21.5 MB but now 114,089 rows vs. 11,584 rows in pivot_table) so it has average row size of 190B vs 1.9KB in initial version. Which is obviously about 10 times less as per number of columns in the example.
So before using this approach some math needs to be done to project/estimate what and how can be done!

enter image description here

Still: each cell in pivot table is sort of JSON representation of whole row in original table. It is such as it holds not just values as it was for rows in original table but also has a schema in it

enter image description here

As such it is quite verbose - thus the size of cell can be multiple times bigger than original size [which limits the usage of this approach ... unless you get even more creative :o) ... which is still plenty of areas here to apply :o) ]

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Wow. Going to test this now. What tool/UI is that Mikhail? – Graham Polley Mar 31 '17 at 02:19
  • @GrahamPolley – not using BigQuery Web UI for quite long time already, which btw doesn't help me in maintaining well BQ Mate, so I really depend with this on users' feedback - still trying to keep it updated and make fixes when know needed and have some spare time. – Mikhail Berlyant Mar 31 '17 at 05:04
  • @GrahamPolley – As of - `what tool/UI is this?` - it is one of few tools I had to create (with my Team of course) based on business users’ demand – this compensate many features lacking in Google’s Web UI. We get to the point when it just doesn’t perform as needed – and even mostly crashes on our volume and scale. We have/use these tools for few years already – some of features/improvements are slowly being added by BQ Team, but still far enough from what we need (and have in our internal tools) – Mikhail Berlyant Mar 31 '17 at 05:05
  • 1
    This is probably the most ingenious solution (for BigQuery anyway) that I've seen on Stackoverflow Mikhail. Much respect! (upvoted) – Graham Polley Mar 31 '17 at 08:26
  • Doesn't BigQuery have a ~10 or ~20MB limit on row size that defeats this method on anything but small datasets (for which table scans are cheap anyways)? – Charles Offenbacher Apr 03 '17 at 19:36
  • @CharlesOffenbacher - thank you for checking my answer and providing feedback. I think it is hard to provide universal solution for any problem and specifically for one in this post. See update in my answer. As of `anything but small datasets (for which table scans are cheap anyways)` - it is not correct. Dependency is not on table size but row size. So many use cases are with moderate row size but billions of rows – cost is huge – and above approach will save tons of $$$ :o ) – Mikhail Berlyant Apr 03 '17 at 22:05
  • @MikhailBerlyant nice update! That looks like it work work. but about the cost, BigQuery is billed per byte read, not per row, right? So if we consider 365 rows of 1mb and 365k rows of 1k, the $ cost of a table scan if you're selecting all data is the same, isn't it? – Charles Offenbacher Apr 04 '17 at 02:17
  • @CharlesOffenbacher the query costs are reduced quite a bit: the pivot table step queries all the data in the table once, and the daily table creation queries each record in the pivot once. Creating daily partitioned tables with `SELECT * FROM .... WHERE date = partition_date`, will bill you for querying the whole source table each time you create a partition. For many Tb tables that span multiple years, the difference could be savings of 10k or more. – rmg Dec 08 '17 at 12:56
  • @rmg doesn't it require the same `SELECT * FROM .... WHERE date = partition_date` for the query using the pivot table method? It will only return one row, but according to BigQuery column-store pricing: "You're charged according to the total data processed in the columns you select". The pivot table method doesn't seem to actually reduce the amount of data in the columns you select, am I missing something? – Charles Offenbacher Dec 09 '17 at 17:24
  • @CharlesOffenbacher The pivot table method doesn't use a `WHERE date=partition`. It does `SELECT *` once to create the pivot table. To create each partitioned table, you select the column representing each date only once from the pivot table, billing you for selecting only one column each time. You do this for the number of partition dates, meaning you get billed for selecting all the data in the table twice: once during the pivot table step, and once for the aggregate total after selecting all partition date columns. – rmg Dec 11 '17 at 10:22
  • Ah, I see, each date is a column, not a row. Sorry, it's been a long time, I forgot what the original post looked like. – Charles Offenbacher Dec 11 '17 at 13:41
  • Hi Mikhail! Since this question still has a large # of views and activity - can you update the answer so people can know that now it will only take 1 scan? – Felipe Hoffa May 14 '19 at 07:14
11

As of today, you can now create a partitioned table from a non-partitioned table by querying it and specifying the partition column. You'll pay for one full table scan on the original (non-partitioned) table. Note: this is currently in beta.

https://cloud.google.com/bigquery/docs/creating-column-partitions#creating_a_partitioned_table_from_a_query_result

To create a partitioned table from a query result, write the results to a new destination table. You can create a partitioned table by querying either a partitioned table or a non-partitioned table. You cannot change an existing standard table to a partitioned table using query results.

Graham Polley
  • 14,393
  • 4
  • 44
  • 80
  • Can we achieve this using a Java client as well? I am using client version1.34.0 and there doesn't seem to be an option to select partition option. Is this only supported in REST API? – user_1357 Jul 14 '18 at 18:20
  • when I say REST API, I mean old API. Thanks – user_1357 Jul 15 '18 at 20:09
7

Until the new feature is rolled out in BigQuery, there is another (much cheaper) way to partition the table(s) by using Cloud Dataflow. We used this approach instead of running hundreds of SELECT * statements, which would have cost us thousands of dollars.

  1. Create the partitioned table in BigQuery using the normal partition command
  2. Create a Dataflow pipeline and use a BigQuery.IO.Read sink to read the table
  3. Use a Partition transform to partition each row
  4. Using a max of 200 shards/sinks at a time (any more than that and you hit API limits), create a BigQuery.IO.Write sink for each day/shard that will write to the corresponding partition using the partition decorator syntax - "$YYYYMMDD"
  5. Repeat N times until all data is processed.

Here's an example on Github to get you started.

You still have to pay for the Dataflow pipeline(s), but it's a fraction of the cost of using multiple SELECT * in BigQuery.

Graham Polley
  • 14,393
  • 4
  • 44
  • 80
5

If you have date sharded tables today, you can use this approach:

https://cloud.google.com/bigquery/docs/creating-partitioned-tables#converting_dated_tables_into_a_partitioned_table

If you have a single non-partitioned table to be converted to partitioned table, you can try the approach of running a SELECT * query with allow large results and using the table's partition as the destination (similar to how you'd restate data for a partition):

https://cloud.google.com/bigquery/docs/creating-partitioned-tables#restating_data_in_a_partition

Please note that this approach will charge you the scan cost of the source table for the query as many times as you query it.

We are working on something to make this scenario significantly better in the next few months.

Pavan Edara
  • 2,285
  • 1
  • 12
  • 12
  • How do I ensure with the 2nd aproach that the schema is type DAY? And what other types we have other than DAY? – Pentium10 Aug 17 '16 at 16:21
  • We only support a DAY granularity at this point since this has historically been the most common usage pattern. I may have misunderstood your question, but you can create the table upfront with the schema, for example: bq mk --schema="field1:integer,field2:string" --time_partitioning_type=DAY mydataset.table1 – Pavan Edara Aug 17 '16 at 16:24
  • ok good, so I need to create a new table with partitioning and then select * and this will retain all the repeated and nested type we have, right? – Pentium10 Aug 17 '16 at 16:27
  • It should indeed retain the repeated and nested types. Please let us know if you run into any problems with this. Thanks! – Pavan Edara Aug 17 '16 at 16:59
  • 2
    I have a timestamp field, which I would like to be used to partitioned. How can I tell to the system that it should use that field? – Pentium10 Aug 17 '16 at 19:45
  • This functionality is not available yet -- but please stay tuned for updates on this. For now, you can achieve this functionality as described above by selecting using a filter on that column and outputting to the partition as destination (eg. table$20160701). – Pavan Edara Aug 17 '16 at 20:12
  • 2
    So suppose I have several years of data, I need to prepare different query for each day and run all of it, and suppose I have 1000 days in history, I need to pay 1000 times the full query price from the source table? If this is true, it's "unbelievable". – Pentium10 Aug 18 '16 at 06:23
  • 4
    We are actively working on making a single query be able to do this, but for the time being it is possible to reduce cost using an year, month, day split strategy. First partition into multiple yearly tables, then each one into monthly tables and then each one into daily tables. We understand that this is expensive and cumbersome for the user, so we hope to have an easier, cheaper solution soon. Thank you. – Pavan Edara Aug 18 '16 at 06:34
  • 1
    @PavanEdara any updates on this? Going to be a big, ugly, expensive conversion for my team otherwise :( – Charles Offenbacher Jan 04 '17 at 03:55
  • We had to delay the release of this due to unforeseen change in plans. We will provide a more useful update as soon as we are able. We sincerely apologize for the delay. – Pavan Edara Mar 13 '17 at 03:06
-1
CREATE TABLE `dataset.new_table`
PARTITION BY DATE(date_column) 
AS SELECT * FROM `dataset.old_table`;

drop table `dataset.old_table`;

ALTER TABLE `dataset.new_table`
RENAME TO old_table;
Kamlesh
  • 64
  • 1