87

I just discovered that the RAND() function, while undocumented, works in BigQuery. I was able to generate a (seemingly) random sample of 10 words from the Shakespeare dataset using:

SELECT word FROM
(SELECT rand() as random,word FROM [publicdata:samples.shakespeare] ORDER BY random)
LIMIT 10

My question is: Are there any disadvantages to using this approach instead of the HASH() method defined in the "Advanced examples" section of the reference manual? https://developers.google.com/bigquery/query-reference

Pat Myron
  • 4,437
  • 2
  • 20
  • 39
David M Smith
  • 2,212
  • 4
  • 21
  • 27

6 Answers6

112

For stratified sampling, check https://stackoverflow.com/a/52901452/132438


Good job finding it :). I requested the function recently, but it hasn't made it to documentation yet.

I would say the advantage of RAND() is that the results will vary, while HASH() will keep giving you the same results for the same values (not guaranteed over time, but you get the idea).

In case you want the variability that RAND() brings while still getting consistent results - you can seed it with an integer, as in RAND(3).

Notice though that the example you pasted is doing a full sort of the random values - for sufficiently big inputs this approach won't scale.

A scalable approach, to get around 10 random rows:

SELECT word
FROM [publicdata:samples.shakespeare]
WHERE RAND() < 10/164656

(where 10 is the approximate number of results I want to get, and 164656 the number of rows that table has)


standardSQL update:

#standardSQL
SELECT word
FROM `publicdata.samples.shakespeare`
WHERE RAND() < 10/164656

or even:

#standardSQL
SELECT word
FROM `publicdata.samples.shakespeare`
WHERE RAND() < 10/(SELECT COUNT(*) FROM `publicdata.samples.shakespeare`)
Community
  • 1
  • 1
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • 4
    Hi Felipe! Can you explain what you are doing with `WHERE RAND() < 10/164656`? Thanks! – Rohit Apr 29 '14 at 23:40
  • 5
    RAND() gives me a random number between 0 and 1. So if I ask for a set of random numbers, only around a tenth of them will be less than 0.1, only around a hundredth of them will be less than 0.01, and so on. Shakespeare table has 164656 rows. If I want only around 1 random row, I'll give each row a random number, and pick the ones with it < 1/164656. It's not guaranteed that there will be one - but still. 10? 10*1/164656=10/164656. Try it out :). – Felipe Hoffa Apr 29 '14 at 23:49
  • @FelipeHoffa I'm trying to implement your method, but I'm getting wildly different results. I have a table of about 6 billion rows and after I do the sampling using e.g. `rand() < 0.004`, the ratio of the sample size over the total number of rows comes out as `0.0002`. In other words, my sample is about 20 times smaller than expected. Do you have an explanation for this behavior? – Kris Jan 15 '15 at 11:09
  • 1
    6 billion * 0.004 = 24 million. If I try it on a ~6 billion record table, I get >23 million rows. Works for me: SELECT COUNT(*) FROM ( SELECT requests FROM [fh-bigquery:wikipedia.pagecounts_201410] WHERE RAND() < 0.004 ) – Felipe Hoffa Jan 15 '15 at 20:40
  • 2
    @FelipeHoffa You're absolutely right. I must've made a mistake before, cause I tried it again this morning and now it is working swimmingly. Thanks! – Kris Jan 16 '15 at 10:26
  • Hi Felipe, do you have some modification of this method that leads to a totally reproducible sample? When I try this (specifying a random seed) I do not get the same sample every time. – RKD314 Aug 02 '16 at 15:57
  • This page also explains how hash function can be used and compares to rand: https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning – s6mike Nov 21 '17 at 11:24
  • would it be possible to get 100% plus a % randomly ... like WHERE RAND() <= 1.02 ? – Cristián Vargas Acevedo Feb 04 '22 at 18:23
22

Great to know RAND() is available!

In my case I needed a predefined sample size. Instead of needing to know the total number of rows and do the division sample size over total rows, I'm using the following query:

SELECT word, rand(5) as rand
FROM [publicdata:samples.shakespeare]
order by rand
#Sample size needed = 10
limit 10

Summarizing, I use ORDER BY + LIMIT to ramdomize and then extract a defined number of samples.

fernandosjp
  • 2,658
  • 1
  • 25
  • 29
  • 7
    Note that a 'semi-random-but-deterministic' ordering can be achieved using `ORDER BY MD5(someIdentifier)` – Melle Nov 28 '18 at 15:46
  • 1
    `order by` will lead to memory exhaustion in case you had a large table since the ordering takes place in a single node. – S. P Mar 21 '22 at 01:35
  • Note that at the time of this comment, rand does not accept any arguments, such as a seed. Right now, RAND(5) would produce an error. – geo909 Oct 09 '22 at 14:05
22

BigQuery have the sampling option, which let us query random subsets of data from large BigQuery tables. We can also mention the percentage of records.

Example:

SELECT * FROM dataset.my_table TABLESAMPLE SYSTEM (10 PERCENT)

SELECT *
FROM dataset.table1 T1 TABLESAMPLE SYSTEM (10 PERCENT)
JOIN dataset.table2 T2 TABLESAMPLE SYSTEM (20 PERCENT) ON customer_id

https://cloud.google.com/bigquery/docs/table-sampling

SANN3
  • 9,459
  • 6
  • 61
  • 97
10

One additional tip to make it even simpler: You can order by the function it self, ie:

select x from y order by rand() limit 100

=> Sample of 100

6

Once you calculate how much of total you need % wise you can...

Like mentioned before one way is to do non deterministic ( every time you run = different sample) with random such as for example if you want 0.1% of your total database sampled you would do :

SELECT *
FROM `dataset.table`
WHERE RAND() < 0.001 

You could actually make it deterministic by say saving this table so you can query it later, you could also select just one key column and save only that to be used in the future.


Another way that gets you the same repeatable random sample is to use cryptographic hashing function to generate a fingerprint of your (unique identifier field) column and then to select rows based on the two digits of the fingerprint. Below would label a random sample of 70% of total database. After which you can filter table on in_sample = True:

SELECT
*,
IF(MOD(ABS(FARM_FINGERPRINT(CAST(YOUR_COLUMN AS STRING))), 100) < 70,'True', 'False') 
AS in_sample
FROM (
     SELECT
     DISTINCT(YOUR_UNIQUE_IDENTIFIER_COLUMN) AS YOUR_COLUMN
     FROM
     `dataset.table`)

If you don't have a unique identifier column you could make one with GENERATE_UUID() OR use row_number() over() OR concatenate multiple columns to make one


Similar way as above but with hashing function. Repeatable and gets you 70% of your sample. If you want other number just change 7 to your desire %:

SELECT
*
FROM
`dataset.table`
WHERE
ABS(HASH(YOUR_COLUMN)) % 10 < 7

Don't know about scalability of fingerprint vs hash so mentioned both, one may work better than other for some.

Best of luck,

Yev Guyduy
  • 1,371
  • 12
  • 13
  • 1
    Thanks for your answer. The second option leads to: `Syntax error: Expected ")" but got keyword HASH at [7:7]`. – S. P Mar 21 '22 at 01:43
  • Also, the first option only fetches the unique identifier and `in_sample` column. What if I wanted to retrieve all the other columns too? – S. P Mar 21 '22 at 02:00
  • Hello, please check second option syntax, i just checked it on my side and it ran. For the first option, just make a unique identifier column in your original table using for example row_number() over() or GENERATE_UUID() and save whatever comes back as another table, then use that table to query original table where UUID present in your sampled talbe – Yev Guyduy Mar 21 '22 at 13:34
2

The RAND() function in BigQuery is actually well-documented. You can check the documentation here: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#rand

This function generates a random number between 0 and 1 for each row of a query result. While it can be used to generate a random sample, there are some disadvantages to using this approach. For example, BIAS and Reproducibility. However, the most troubling problem may be the performance, HASH() will be the most of the time in big datasets more efficient than RAND(). Furthermore, I recommend checking out the official function for sampling data: https://cloud.google.com/bigquery/docs/table-sampling

It is very simple to use, example, the next code selects the 10% of the data of dataset.my_table:

SELECT * FROM dataset.my_table TABLESAMPLE SYSTEM (10 PERCENT)
starball
  • 20,030
  • 7
  • 43
  • 238
NovasVilla
  • 46
  • 5