3

The answers to this question explain how to randomly sample from a BigQuery table. Is there an efficient way to do this with replacement?

As an example, suppose I have a table with 1M rows and I wish to select 100K independently random sampled rows.

Max
  • 488
  • 3
  • 12
  • 1
    Please explain what YOU mean by "100K independent random samples". How big is each sample? What do you want the data to look like? How are the samples represented in the result set? – Gordon Linoff Feb 26 '20 at 21:40
  • Yes, and please add some color on why would you want to do that – Felipe Hoffa Feb 26 '20 at 21:46

1 Answers1

6

Found a neat solution:

  • Index the rows of the table
  • Generate a dummy table with 100K random integers between 1 and 1M
  • Inner join the tables on index = random value

Code:

# randomly sample 100K rows from `table` with replacement
with large_table as (select *, row_number() over() as rk from `table`),
num_elements as (select count(1) as n from large_table),
dummy_table as (select 1 + cast(rand() * (select n - 1 from num_elements) as int64) as i from unnest(generate_array(1, 100000)))
select * from dummy_table
inner join large_table on dummy_table.i = large_table.rk
Max
  • 488
  • 3
  • 12