0

The below script is to randomly sample an approximate number of rows (50k).

SELECT      *
FROM        table
qualify     rand() <= 50000 / count(*) over()

This has worked a handful of times before, hence, I was shocked to find this error this morning:

int64 overflow: 8475548256593033885 + 6301395400903259047

I have read this post. But as I am not summing, I don't think it is applicable. The table in question has 267,606,559 rows.

Looking forward to any ideas. Thank you.

Sweepy Dodo
  • 1,761
  • 9
  • 15

1 Answers1

3

I believe counting is actually a sum the way BQ (and other databases) compute counts. You can see this by viewing the Execution Details/Graph (in the BQ UI). This is true even on a simple select count(*) from table query.

For your problem, consider something simpler like:

select *, rand() as my_rand
from table
order by my_rand
limit 50000

Also, if you know the rough size of your data or don't need exactly 50K, consider using the tablesample method:

select * from table
tablesample system (10 percent)
rtenha
  • 3,349
  • 1
  • 6
  • 19
  • While `tablesample system (x percent)` is handy. It is still sub-optimum as I am looking to sample (approx) n no. of rows but not a %. Because there is no guarantee the x % I chose is close to the n no. of rows I desire (as it requires knowledge of the total row count in table) With the order method, I've seen it mentioned on SO but am actively avoiding it as it involves sorting an entire table. Thank you. But wonder if there are yet, other methods – Sweepy Dodo Dec 14 '22 at 14:58
  • Those are your main options in BQ/SQL. Why do you want to avoid ordering? Your qualify method is way more computationally intense (and prone to overflow errors to get back to the point of your question). – rtenha Dec 14 '22 at 17:17
  • Sorting is a problem with huge tables as it requires sorting the entire table. See comments [here](https://stackoverflow.com/questions/19412/how-to-request-a-random-row-in-sql). As for why a fix number of rows. This is to control how large a data set is as I am looking to feed into a software for model building. Using % means size (memory required) of data set could vary – Sweepy Dodo Dec 14 '22 at 22:21
  • That's BQs problem, not yours, IMO. You're trading one problem for another. In your original, you aren't guaranteed to get 50K, just "about 50K". My proposed solution gets you exactly 50K, if that is really a great concern. – rtenha Dec 15 '22 at 01:14
  • thank you for your suggestion. But it doesn't achieve what I need because this script is to be run daily. Each day, the no. of rows in the dataabase differ. Thus, the % method would yield a different no. of rows. On a day where there is a spike in no. of rows, this would crash the local pc. I will have to modify `tablesample system (x percent)` to take have it take in x as a function of total no. of rows of that day. Hence, wonder if there are other methods – Sweepy Dodo Dec 15 '22 at 09:46