0

I need to move BigQuery datasets with many tables (both partitioned and unpartitioned) from the US to the EU. If the source table is unpartitioned, the documented way of bq extracting the data to GCS and bq loading it in another region works fine, so far so good.

If however the source table is partitioned, during the load step the mapping between data and partition is lost and I'll end up having all data within one partition.

Is there a good (automated) way of exporting and importing partitioned tables in BQ? Any pointers would be greatly appreciated!

  • What type of partitioning are you using? I'm assuming ingestion based partitioning and not column baed (otherwise it wouldn't be such a problem :)).. – Graham Polley May 09 '18 at 11:05
  • The tables are mostly partitioned by day, so in the table you can find a column `_PARTITIONTIME`. But this column is a pseudo-column and not part of the table's schema, therefore it is excluded during the extract. – hschoeneberg May 09 '18 at 11:26
  • Right, so ingestion time based partitioning and not column based. Is there a column (timestamp/date) in your table that would suffice as a partition instead of `_PARTITIONTIME`? – Graham Polley May 09 '18 at 11:42
  • On each table there is some sort of timestamp from which the target partition was inferred during ingestion. But the columns vary from table to table. – hschoeneberg May 09 '18 at 11:59
  • Perfect. Then why not just use the corresponding column in each table and load them as column based partitioned tables? You'll have to change the SQL that references them i.e. change `_PARTITIONTIME` to the partitioned column, but it's the cleanest and best solution IMHO. – Graham Polley May 09 '18 at 12:08
  • While I agree that this would definitely be a cleaner solution in the future, I have to make sure to migrate our datasets to the EU as they are right now to not impact the tables' consumers. – hschoeneberg May 09 '18 at 12:28
  • You can still migrate them. 1. Extract to GCS. 2. Load into EU dataset as a column based partitioned table. Is the problem that you can't update the SQL? Do you use views? – Graham Polley May 09 '18 at 13:08
  • There are some views included, but we cannot have the tables' clients update their SQL at this time. But I will check out your suggestion, you are referring to [this](https://cloud.google.com/bigquery/docs/partitioned-tables) right? – hschoeneberg May 09 '18 at 13:24
  • Correct, the 2nd (newer) one. You can now partition on a column. Ok, so if you can't migrate to column based partitioned tables, then I'd suggest using Cloud Dataflow to do this. – Graham Polley May 09 '18 at 13:29
  • Yeah, I will check out what I can do in Dataflow. Thx! – hschoeneberg May 09 '18 at 14:05

2 Answers2

0

There's a few ways to do this, but I would personally use Cloud Dataflow to solve it. You'll have to pay a little bit more for Dataflow, but you'll save a lot of time and scripting in the long run.

High level:

  1. Spin up a Dataflow pipeline
  2. Read partitioned table in US (possibly aliasing the _PARTITIONTIME to make it easier later)
  3. Write results back to BigQuery using same partition.

It's basically the same as what was talked about here.

Graham Polley
  • 14,393
  • 4
  • 44
  • 80
0

Another solution is to use DML to load the data, instead of load, https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables. Since you have a timestamp column in the table to infer the partition, you can use

INSERT INTO PROJECT_ID.DATASET.mytable (_PARTITIONTIME, field1, field2) AS SELECT timestamp_column, 1, “one” FROM PROJECT_ID.DATASET.federated_table

You can define a permanent federated table, or a temporary one, https://cloud.google.com/bigquery/external-data-cloud-storage#permanent-tables. You'll need to pay for DML though, while load is free.

Hua Zhang
  • 1,521
  • 11
  • 11