3

I am creating a service that allows users to apply filters on bigquery data and export it as csv or json. Is there a way I can estimate the time, bigquery will take to export a set of rows.

Currently, I am recording the number of rows and the time it took to finish the export job. Then I take the average time of exporting a single row to estimate the time. But it is certainly not a linear problem.

Any suggestion on the prediction algorithm would be great too.

  • Pretty much we are doing the same, we keep an average per report type and table size for the last 7 and 30 days, and we leverage the stats from those. We don't do per row. – Pentium10 Sep 15 '14 at 09:01

1 Answers1

1

Unfortunately, there isn't a great way to predict how long the export would take. There are a number of factors:

  • How many "shards" of data your table is broken up into. This is related to how compressible your data is and to some extent, how you've loaded your tables into bigquery. BigQuery will attempt to do extracts in parallel as long as you pass a 'glob' path as your extract destination (e.g. gs://foo/bar/baz*.csv).
  • The size of the table.
  • The number of concurrently running extract jobs. The higher the overall system load, the fewer resource that will be available to your extract job.

Since most of these factors aren't really under your control, the best practices are:

  • Always pass a glob path as your destination path so that bigquery can extract in parallel.
  • If your table is small, you can use tabledata.list to extract the data instead of export.

Note that there are a couple of open bugs with respect to extract performance that we're working on addressing.

Jordan Tigani
  • 26,089
  • 4
  • 60
  • 63