1

I have a bigquery table that's clustered by several columns, let's call them client_id and attribute_id.

What I'd like is to submit one job or command that exports that table data to cloud storage, but saves each cluster (so each combination of client_id and attribute_id) to its own object. So the final uri's might be something like this:

gs://my_bucket/{client_id}/{attribute_id}/object.avro

I know I could pull this off by iterating all the possible combinations of client_id and attribute_id and using a client library to query the relevant data into a bigquery temp table, and then export that data to correctly named object, and I could do so asynchronously.

But.... I imagine all the clustered data is already stored in a format somewhat like what I'm describing, and I'd love to avoid the unnecessary cost and headache of writing the script to do it myself.

Is there a way to accomplish this already without requesting a new feature to be added?

Thanks!

blaineh
  • 2,263
  • 3
  • 28
  • 46
  • feature requests can be submitted at https://issuetracker.google.com/issues/new?component=187149&template=1162659 – Mikhail Berlyant Mar 29 '19 at 22:08
  • I've updated my question slightly to specify that I'm looking for a way to accomplish this without waiting for a feature request :) I'll certainly do that though if it turns out there's no existing way. – blaineh Mar 29 '19 at 22:20
  • Is it really a good idea to store millions of objects on GCS? It feels to me that you need a point access on your BQ table and looking for an faster alternative. Maybe Cloud SQL or Cloud DataStore work for you in this case. – Yun Zhang Mar 30 '19 at 00:32
  • We're using bigquery to perform calculations on large datasets, but the results themselves will be reasonably sized. The number of objects representing the results will only get into the thousands. – blaineh Mar 30 '19 at 00:47
  • 1
    Why do you want to do this i.e what's the use case? – Graham Polley Mar 30 '19 at 06:59
  • Essentially we want results of our calculations to be available to a client program, and they're really only useful in these "partition chunks". I indicated I can and will write a script to do this manually if necessary, but it would be much cheaper and more efficient if we could just export the table like I'm describing. – blaineh Apr 01 '19 at 16:17
  • Currently we're doing this by exporting the whole table to a relational database. Each "chunk" is several hundred rows, so for a relational database that's actually a pretty painful amount of data (thousands of chunks, hundreds of rows each). But these chunks are also completely static once they've been computed, so we think they're a perfect candidate for object storage (each chunk of rows is independently useful, and they're very static). – blaineh Apr 01 '19 at 16:41
  • And I'm obfuscating the real exact nature of the data on purpose (it's for a proprietary tool), so I can see how my throwaway example of `client_id` and `attribute_id` doesn't seem very compelling. Just trust me hahaha – blaineh Apr 01 '19 at 16:44
  • The [beta bigquery storage api](https://googleapis.github.io/google-cloud-python/latest/bigquery_storage/index.html) might give me basically what I'm looking for. Once I get a chance to test it we'll see! – blaineh Apr 02 '19 at 23:26
  • The BigQuery Storage API uses rpc to provide an efficient way to access the data. It doesn't seem to be what you are looking for. I think that you have to either implement the script or wait for the feature request – Tlaquetzal Aug 22 '19 at 13:44

0 Answers0