Bigquery, save clusters of clustered table to cloud storage

Question

I have a bigquery table that's clustered by several columns, let's call them client_id and attribute_id.

What I'd like is to submit one job or command that exports that table data to cloud storage, but saves each cluster (so each combination of client_id and attribute_id) to its own object. So the final uri's might be something like this:

gs://my_bucket/{client_id}/{attribute_id}/object.avro

I know I could pull this off by iterating all the possible combinations of client_id and attribute_id and using a client library to query the relevant data into a bigquery temp table, and then export that data to correctly named object, and I could do so asynchronously.

But.... I imagine all the clustered data is already stored in a format somewhat like what I'm describing, and I'd love to avoid the unnecessary cost and headache of writing the script to do it myself.

Is there a way to accomplish this already without requesting a new feature to be added?

Thanks!

feature requests can be submitted at https://issuetracker.google.com/issues/new?component=187149&template=1162659 — Mikhail Berlyant, Mar 29 '19 at 22:08
I've updated my question slightly to specify that I'm looking for a way to accomplish this without waiting for a feature request :) I'll certainly do that though if it turns out there's no existing way. — blaineh, Mar 29 '19 at 22:20
Is it really a good idea to store millions of objects on GCS? It feels to me that you need a point access on your BQ table and looking for an faster alternative. Maybe Cloud SQL or Cloud DataStore work for you in this case. — Yun Zhang, Mar 30 '19 at 00:32
We're using bigquery to perform calculations on large datasets, but the results themselves will be reasonably sized. The number of objects representing the results will only get into the thousands. — blaineh, Mar 30 '19 at 00:47
Essentially we want results of our calculations to be available to a client program, and they're really only useful in these "partition chunks". I indicated I can and will write a script to do this manually if necessary, but it would be much cheaper and more efficient if we could just export the table like I'm describing. — blaineh, Apr 01 '19 at 16:17
Currently we're doing this by exporting the whole table to a relational database. Each "chunk" is several hundred rows, so for a relational database that's actually a pretty painful amount of data (thousands of chunks, hundreds of rows each). But these chunks are also completely static once they've been computed, so we think they're a perfect candidate for object storage (each chunk of rows is independently useful, and they're very static). — blaineh, Apr 01 '19 at 16:41
And I'm obfuscating the real exact nature of the data on purpose (it's for a proprietary tool), so I can see how my throwaway example of `client_id` and `attribute_id` doesn't seem very compelling. Just trust me hahaha — blaineh, Apr 01 '19 at 16:44
The [beta bigquery storage api](https://googleapis.github.io/google-cloud-python/latest/bigquery_storage/index.html) might give me basically what I'm looking for. Once I get a chance to test it we'll see! — blaineh, Apr 02 '19 at 23:26
The BigQuery Storage API uses rpc to provide an efficient way to access the data. It doesn't seem to be what you are looking for. I think that you have to either implement the script or wait for the feature request — Tlaquetzal, Aug 22 '19 at 13:44

Bigquery, save clusters of clustered table to cloud storage

0 Answers0