1

I have a legacy unpartitioned big query table that streams logs from various sources (Let's say Table BigOldA). The aim is to transfer it to a new day partition table (Let's say PartByDay) which is done with the help of the following link:

https://cloud.google.com/bigquery/docs/creating-column-partitions#creating_a_partitioned_table_from_a_query_result

bq query 
--allow_large_results 
--replace=true 
--destination_table <project>:<data-set>.<PartByDay>
--time_partitioning_field REQUEST_DATETIME 
--use_legacy_sql=false 'SELECT * FROM `<project>.<data-set>.<BigOldA>`'

I have migrated the historical data to the new table but I cannot delete them in Table BigOldA as I am running into the same problem with running DMLs on streaming buffer tables are not supported yet.

Error: UPDATE or DELETE DML statements are not supported over 
table <project>:<data-set>.BigOldA with streaming buffer

I was planning to run batch jobs everyday transferring T-1 data from Table BigOldA to Table PartByDay and deleting them periodically so that I can still maintain the streaming buffer data in Table BigOldA and start using PartByDay Table for analytics. Now I am not sure if it's achievable.

I am looking for an alternative solution or best practice on how to periodically transfer & maintain stream buffering table to partitioned table. Also, as the data is streaming from independent production sources it's not possible to point all sources streaming to PartByDay and streamingbuffer properties from tables.get is never null.

Logan
  • 1,331
  • 3
  • 18
  • 41
  • Maybe I don't quite understand your question, or I've missed something, but why can't you just delete the original table and then rename the migrated table to the original name after you've run the your history job? (this assumes your streaming component to BigQuery is fault tolerant). Essentially, I don't quite get this: _"I have migrated the historical data to the new table but I cannot delete them in Table BigOldA"_ – Graham Polley Apr 10 '18 at 13:06
  • Not sure if the streaming component is fully fault-tolerant & definitely not prepared to lose the data as well during the process of deletion & rename. Also, the apps & microservices that is streaming the data to bigquery that's written will treat the partition by day table as same & start persisting data into the new partitions with no code change? – Logan Apr 10 '18 at 13:25
  • If it's designed well, you shouldn't lose any data. Whatever is streaming to BigQuery should be able to store events until the table comes back online. It shouldn't change anything for your components that are streaming once the table is partitioned. They just see the same table, but under the hood it's now partitioned. That said, I haven't tested that, so you should ;) – Graham Polley Apr 10 '18 at 13:32
  • @GrahamPolley can you please add your comment as an answer? – Lefteris S Apr 27 '18 at 12:22
  • 1
    @GrahamPolley thanks man.. the one you suggested works. Just gotta check for duplicates & remove it – Logan May 02 '18 at 12:00
  • @GrahamPolley: How to query the data only in the streaming buffer with this type of partition? – Logan Jun 05 '18 at 04:21
  • @Logan maybe create a new question for that? – Graham Polley Jun 07 '18 at 08:09

2 Answers2

2

You could just delete the original table and then rename the migrated table to the original name after you've run the your history job. This assumes your streaming component to BigQuery is fault tolerant. If it's designed well, you shouldn't lose any data. Whatever is streaming to BigQuery should be able to store events until the table comes back online. It shouldn't change anything for your components that are streaming once the table is partitioned.

Graham Polley
  • 14,393
  • 4
  • 44
  • 80
1

If anyone interested in the script, here you go.

#!/bin/sh
# This script
# 1. copies the data as the partitioned table
# 2. delete the unpartitioned table
# 3. copy the partitioned table to the same dataset table name
# TODO 4. deletes the copied table

set -e
source_project="<source-project>"
source_dataset="<source-dataset>"
source_table="<source-table-to-partition>"

destination_project="<destination-project>"
destination_dataset="<destination-dataset>"
partition_field="<timestamp-partition-field>"
destination_table="<table-copy-partition>"

source_path="$source_project.$source_dataset.$source_table"
source_l_path="$source_project:$source_dataset.$source_table"
destination_path="$destination_project:$destination_dataset.$destination_table"

echo "copying table from $source_path to $destination_path"
query=$(cat <<-END
SELECT * FROM \`$source_path\`
END
)

echo "deleting old table"
bq rm -f -t $destination_path
echo "running the query: $query"

bq query --quiet=true --use_legacy_sql=false --apilog=stderr --allow_large_results --replace=true --destination_table $destination_path --time_partitioning_field $partition_field "$query"

echo "removing the original table: $source_path"
bq rm -f -t $source_l_path
echo "table deleted"
echo "copying the partition table to the original source path"
bq cp -f -n $destination_path $source_l_path
Logan
  • 1,331
  • 3
  • 18
  • 41