21

I am using BigQuery, and I need to compute the 25th, 50th, and 75th percentile of a column of a dataset.

For example, how can I get the aforementioned numbers using BigQuery and STANDARD SQL. I have looked at the PERCENT_RANK, RANK, and NTILE functions but I can't seem to crack it.

Here's some code that may guide me

Appreciate the help!

Praangrammer
  • 219
  • 1
  • 2
  • 4

3 Answers3

24

Check out APPROX_QUANTILES function in Standard SQL. If you ask for 100 quantiles - you get percentiles. So the query will look like following:

SELECT percentiles[offset(25)], percentiles[offset(50)], percentiles[offset(75)]
FROM (SELECT APPROX_QUANTILES(column, 100) percentiles FROM Table)
Mosha Pasumansky
  • 13,206
  • 5
  • 32
  • 55
  • 1
    Is there a way to get the exact percentile values? It looks like APPROX_QUANTILES's aggregation is approximate... – Hoda Oct 18 '19 at 17:00
8

In case approximate aggregation does not work for you, you might want to use the PERCENTILE_CONT function (though it will use much more memory so it might not work for huge data), e.g. the following example is from here

SELECT
  PERCENTILE_CONT(x, 0) OVER() AS min,
  PERCENTILE_CONT(x, 0.01) OVER() AS percentile1,
  PERCENTILE_CONT(x, 0.5) OVER() AS median,
  PERCENTILE_CONT(x, 0.9) OVER() AS percentile90,
  PERCENTILE_CONT(x, 1) OVER() AS max
FROM UNNEST([0, 3, NULL, 1, 2]) AS x LIMIT 1;

+-----+-------------+--------+--------------+-----+
| min | percentile1 | median | percentile90 | max |
+-----+-------------+--------+--------------+-----+
| 0   | 0.03        | 1.5    | 2.7          | 3   |
+-----+-------------+--------+--------------+-----+
Hoda
  • 280
  • 2
  • 5
  • For the sake of completeness, this shows to use 'PERCENTILE_CONT' as aggregate function: https://stackoverflow.com/a/45579962/7224607 – arielhasidim Oct 07 '22 at 14:02
-1

You could get a fine-grain percentile profiling using the code below.

  SELECT
  offset + 1 AS percentile,
  value AS percentile_value,
  FROM UNNEST(( 
    SELECT 
      APPROX_QUANTILES(value, 100) as percentile_value,
    FROM data_set
  )) AS percentile_value  WITH OFFSET as offset

Then you could get any percentile as needed.

Hui Zheng
  • 2,394
  • 1
  • 14
  • 18