1

I know from this question that one can do random sampling RAND.

SELECT * FROM [table] WHERE RAND() < percentage

But this would require a full table scan and incur equivalent cost. I'm wondering if there are more efficient ways?

I'm experimenting with tabledata.list API but got java.net.SocketTimeoutException: Read timed out when index is very large (i.e. > 10000000). Is this operation not O(1)?

bigquery .tabledata() .list(tableRef.getProjectId, tableRef.getDatasetId, tableRef.getTableId) .setStartIndex(index) .setMaxResults(1L) .execute()

Community
  • 1
  • 1
Neville Li
  • 420
  • 3
  • 10

2 Answers2

0

I would recommend paging tabledata.list with pageToken and get collect sample rows from each page. This should scale much better.

Another (totally different) option I see is use of Table Decorators
You can in loop grammatically generate random time (for snapshot) or time-frame (for range) and query only that portions of data extracting needed data.
Note limitation: This will allow you to sample data that is less than 7 days old.

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • My use case is to uniformly sample a small, fixed number of rows without doing the full table IO. The `pageToken` approach would still require going through all rows in a table? – Neville Li Jul 15 '16 at 19:50
0

tabledata.list is not especially performant for arbitrary lookups in a table, especially as you are looking later and later into the table. It is not really designed for efficient data retrieval of an entire table, it's more for looking at the first few pages of data in a table.

If you want to run some operation over all the data in your table, but not run a query, you should probably use an extract job to GCS instead, and sample rows from the output files.

Danny Kitt
  • 3,241
  • 12
  • 22