0

I would like to get help understanding sql query in bigquery.

If I write query as below, this samples random 1% rows out of total rows which column DEP_DELAY are not null. Without RAND() , total number of rows are 235208.

SELECT COUNT(DEP_DELAY) FROM flights.fedtzcorr WHERE RAND() < 0.1

Reading the document, it says it creates fifteen digits of precision after the decimal point. So, literaly understanding, I thought RAND() creates 15 digit number between 0 and 100,000,000,000,000 ...

This function generates values of Decimal type with fifteen digits of precision after the decimal point.

How does RAND() sample 1% of rows from total number of rows ?

Yu Watanabe
  • 621
  • 4
  • 17

2 Answers2

1

Note that the document you pointed to is documentation related to Dataprep, not BigQuery:

The correct documentation is:

Generates a pseudo-random value of type FLOAT64 in the range of [0, 1), inclusive of 0 and exclusive of 1.

So RAND() generates a number between 0 and 1. WHERE RAND() < 0.1 then matches 10% of rows (not 1%). For 1% try WHERE RAND() < 0.01.

See https://stackoverflow.com/a/23376403/132438 for more examples.

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
0

So the query could be the following:

DECLARE deep_delay_count INT64;
DECLARE deep_delay_pct FLOAT64;

SET deep_delay_count = (SELECT COUNT(DISTINCT DEP_DELAY)
                        FROM flights.fedtzcorr
                        WHERE DEP_DELAY IS NOT NULL);
SET deep_delay_pct = IF(deep_delay_count IS NOT NULL, 0.01 * deep_delay_count / deep_delay_count, 0);

SELECT DISTINCT DEP_DELAY
FROM flights.fedtzcorr
WHERE DEP_DELAY IS NOT NULL
  AND RAND() < deep_delay_pct;
OO7
  • 660
  • 4
  • 10