1

I have a table in PostgreSQL 13 that looks like this (modified for the purpose of this question):

SELECT * FROM visits.visitors_log;

   visitor_id |          day           |  source
--------------+------------------------+----------
            9 | 2019-12-30 12:10:10-05 | Twitter
            7 | 2019-12-14 22:10:26-04 | Netflix
            5 | 2019-12-13 15:21:04-05 | Netflix
            9 | 2019-12-22 23:34:47-05 | Twitter
            7 | 2019-12-22 00:10:26-04 | Netflix
            9 | 2019-12-22 13:20:42-04 | Twitter

After converting the times to another timezone, I want to calculate the percentage of visits on 2019-12-22 that came from a specific source.
There are 4 steps involved:

  1. Convert timezones
  2. Calculate how many total visits happened on that day
  3. Calculate how many total visits happened on that day that came from source Netflix
  4. Divide those 2 numbers to get percentage.

I wrote this code which works but seems repetitive and not very clean:

SELECT (SELECT COUNT(*) FROM (SELECT visitor_id, source, day AT TIME ZONE 'PST' FROM visits.visitors_log WHERE day::date = '2019-12-22') AS a
        WHERE day::date = '2019-12-22' AND source = 'Netflix') * 100.0
         /
       (SELECT COUNT(*) FROM (SELECT visitor_id, source, day AT TIME ZONE 'PST' FROM visits.visitors_log WHERE day::date = '2019-12-22') AS b
        WHERE day::date = '2019-12-22')
   AS visitors_percentage;

Can anyone suggest a neater way of answering this question?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
alphaomega
  • 137
  • 1
  • 15

2 Answers2

2

Use an aggregate FILTER clause:

SELECT count(*) FILTER (WHERE source = 'Netflix') * 100.0
     / count(*) AS visitors_percentage
FROM   visits.visitors_log
WHERE  day >= timestamp '2019-12-22' AT TIME ZONE 'PST'
AND    day <  timestamp '2019-12-23' AT TIME ZONE 'PST';

See:

I rephrased the WHERE condition so it is "sargable" and can use an index on (day). A predicate with an expression on the column cannot use a plain index. So I moved the computation of inclusive lower and exclusive upper bound (day boundaries for the given time zone) to the right side of the expressions in the WHERE clause.
Makes a huge difference for performance with big tables.

If you use that query a lot, consider crating a function for it:

CREATE OR REPLACE FUNCTION my_func(_source text, _day date, _tz text)
  RETURNS numeric
  LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
$func$
SELECT round(count(*) FILTER (WHERE source = _source) * 100.0 / count(*), 2) AS visitors_percentage
FROM   visits.visitors_log
WHERE  day >= _day::timestamp AT TIME ZONE _tz
AND    day < (_day + 1)::timestamp AT TIME ZONE _tz;
$func$;

Call:

SELECT my_func('Netflix', '2019-12-22', 'PST');

I threw in round(), which is a totally optional addition.

db<>fiddle here

Aside: "day" is a rather misleading name for a timestamp with time zone column.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
1

Hmmm . . . You can use window functions to calculate the total:

SELECT source, COUNT(*) / SUM(COUNT(*)) OVER () as visitors_percentage
FROM visits.visitors_log
WHERE (day AT TIME ZONE 'PST')::date = '2019-12-22'
GROUP BY SOURCE
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786