2

I wrote a query that returns a bar graph in the terminal using Postgres CLI. The query is slow and inefficient. I would like to change that.

At the base, we have a pretty simple query. We want each row to be a division of the total number of rows in our table. Let's say that our hardcoded number of rows is N_ROWS, and our table is my_table.

Also, let's say N_ROWS equals 8.

select
    (select count(id) from my_table) / N_ROWS * (N_ROWS - num) as level 
from (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8)) as t (num)

In my case, this returns my Y-Axis of the chart as:

 level
-------
 71760
 62790
 53820
 44850
 35880
 26910
 17940
  8970
     0

You can see the issues with that query already.

Can I programmatically generate a number of rows using N_ROWS and not hardcode each row value in VALUES? I also don't like how I perform a new count over my whole table for each row, obviously.

We now need our X-Axis, and this is what I came up with:

select
    r.level,
    case
        when (
            select count(id) from my_table where created_at_utc<= '2019-01-01 00:00:00'::timestamp without time zone
        ) >= r.level then true
    end as "2019-01-01"
from (
    select (select count(id) from my_table) / N_ROWS * (N_ROWS - num) as level from (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8)) as t (num)
) as r;

Which returns our first bucket:

 level | 2019-01-01
-------+------------
 71760 |
 62790 |
 53820 |
 44850 |
 35880 |
 26910 | t
 17940 | t
  8970 | t
     0 | t

I'd rather not hardcode a case statement for each bucket, but, of course, that's what I did. The results are what I was looking for.

 level | 2019-01-01 | 2019-02-01 | 2019-03-01 | 2019-04-01 | 2019-05-01 | 2019-06-01 | 2019-07-01 | 2019-08-01 | 2019-09-01 | 2019-10-01 | 2019-11-01 | 2019-12-01
-------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------
 71760 |            |            |            |            |            |            |            |            |            |            |            | t
 62790 |            |            |            |            | t          | t          | t          | t          | t          | t          | t          | t
 53820 |            |            |            | t          | t          | t          | t          | t          | t          | t          | t          | t
 44850 |            |            | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
 35880 |            | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
 26910 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
 17940 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
  8970 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
     0 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
WebWanderer
  • 10,380
  • 3
  • 32
  • 51
  • 3
    For your dynamic ```VALUES``` have a look at ```generate_series()```. Docs: https://www.postgresql.org/docs/current/functions-srf.html – Islingre Nov 13 '19 at 20:49

1 Answers1

5

There are certainly a few improvements we can make.

First, let's make a test table with some data:

CREATE TABLE test (id bigint, dt date);
-- Add 1 million rows
insert into test select generate_series(1,100000, 1);
-- Add dates from 2019-01-01 to 2019-01-11
update test set dt='2019-01-01'::date + (id/10000)::int;

We can almost substitute your first query to find the levels with this much faster query:

   SELECT unnest(percentile_disc(
                  (
                      SELECT array_agg(x) 
                      FROM generate_series(0, 1, (1::numeric)/8) as g(x))
                  ) WITHIN GROUP (ORDER BY id)
                 ) as l
    FROM test;
   l
--------
      1
  12500
  25000
  37500
  50000
  62500
  75000
  87500
 100000
(9 rows)

Note that the first level is 1 instead of 0, but the rest should be the same.

There are a few other tricks that we can employ:

  • We'll get our list of dates from generate_series
  • We can group the data in test by day (or date_trunc(timestamp) if you have timestamp data) and count the ids. Using a window function over this count will give us a cumulative sum of ids per day.
  • We can use \crosstabview in psql to pivot the resulting query
WITH num_levels AS (
    SELECT 8 as num_levels
), levels as (
   SELECT unnest(percentile_disc(
                  (
                      SELECT array_agg(x) 
                      FROM num_levels
                      CROSS JOIN LATERAL generate_series(0, 1, (1::numeric)/num_levels.num_levels) as g(x))
                  ) WITHIN GROUP (ORDER BY id)
                 ) as l
    FROM test
), dates as (
  SELECT d
  FROM generate_series('2019-01-01T00:00:00'::timestamp, '2019-01-11T00:00:00'::timestamp, '1 day') as g(d)
), counts_per_day AS (
  SELECT dt, 
         sum(counts) OVER (ORDER BY dt) as cum_sum -- the cumulative count
  FROM (
    SELECT dt, 
    count(id) as counts -- The count per day
    FROM test
    GROUP BY dt
  ) sub
)
SELECT l, dt, CASE WHEN cum_sum >= l THEN true ELSE null END
FROM levels, dates
LEFT JOIN counts_per_day ON dt = d
ORDER BY l DESC, d asc
\crosstabview
   l    | 2019-01-01 | 2019-01-02 | 2019-01-03 | 2019-01-04 | 2019-01-05 | 2019-01-06 | 2019-01-07 | 2019-01-08 | 2019-01-09 | 2019-01-10 | 2019-01-11
--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------
 100000 |            |            |            |            |            |            |            |            |            |            | t
  87500 |            |            |            |            |            |            |            |            | t          | t          | t
  75000 |            |            |            |            |            |            |            | t          | t          | t          | t
  62500 |            |            |            |            |            |            | t          | t          | t          | t          | t
  50000 |            |            |            |            |            | t          | t          | t          | t          | t          | t
  37500 |            |            |            | t          | t          | t          | t          | t          | t          | t          | t
  25000 |            |            | t          | t          | t          | t          | t          | t          | t          | t          | t
  12500 |            | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
      1 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
(9 rows)

That query ran in 40ms on my laptop.

The dates could be selected from the max and min of the dates in the test table and the interval could be changed from 1 day depending on how many columns are desired between the max and min.

Jeremy
  • 6,313
  • 17
  • 20