3

I am currently extracting a BigQuery table into sharded .csv's in Google Cloud Storage -- is there any way to shuffle/randomize the rows for the extract? The GCS .csv's will be used as training data for a GCMLE model, and the current exports are in a non-random order as they are bunched up by similar "labels".

This causes issues when training a GCMLE model as you must hand the model random examples of all labels within each batch. While GCMLE/TF has the ability to randomize the order of rows WITHIN individual .csv's, but there is not (to my knowledge) any way to randomize the rows selected within multiple .csv's. So, I am looking for a way to ensure that the rows being output to the .csv are indeed random.

reese0106
  • 2,011
  • 2
  • 16
  • 46
  • You have no control over how the data is exported from BigQuery. I'd first be asking why do you need to do this? What's the problem you're trying to solve.. – Graham Polley Oct 17 '17 at 09:54
  • Possible duplicate of [Google Cloud DataFlow Randomize WritetoBigQuery](https://stackoverflow.com/questions/46778848/google-cloud-dataflow-randomize-writetobigquery) – Graham Polley Oct 17 '17 at 09:58
  • I asked both of these questions, but I believe that they are different questions. The first was a question surrounding a Dataflow pipeline more broadly, while this question asks about functionality within BigQuery (that could potentially be useful in the other question) as opposed to having the question (and answer) get lost in the comments in the event somebody else had a similar question about BigQuery – reese0106 Oct 17 '17 at 13:52

1 Answers1

2

Can BigQuery table extracted rows be randomized?

No. Extract Job API (thus any client built on top of it) has nothing that would allow you to do so.

I am looking for a way to ensure that the rows being output to the .csv are indeed random.

You should first create tables corresponding to your csv file and then extract them one-by-one into separate csv. In this case you can control what goes into what csv

If your concern is cost of processing (you will need to scan table as many times as csv files you will need) - you can check partitioning approaches in Migrating from non-partitioned to Partitioned tables . This still involves cost but substantially reduced one

Finally, zero cost option is to use Tabledata.list API with paging while distributing response throughout your csv files - you can do this easily in client of your choice

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230