2

So I need to use PostgreSQL and ask about day-to-7-days-prior changes in the COUNT(DISTINCT userid) in percentages.

Is this even possible?

Getting the Distinct users by day is fairly trivial:

SELECT COUNT(DISTINCT userid), timestamp::date 
FROM logs
GROUP BY timestamp::date
ORDER BY timestamp::date DESC

How do I convert that into percentages on a today to 7 days prior basis?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Kevin Meyer
  • 2,816
  • 4
  • 21
  • 33
  • Do you want to find the percentage of users active in just the last seven days? Or the percentage of users active per week, since logs began? – Cathy Aug 01 '13 at 02:38
  • The percentage of users active today vs. 7 days ago for that specific day. – Kevin Meyer Aug 01 '13 at 18:28

2 Answers2

3

So we need to take one value for day X, take second value for day X - 7, and calculate %.
The query may look like:

SELECT a.timestamp, 
       a.cnt, 
       b.cnt cnt_minus_7_day, 
       round( 100.0 *( a.cnt - b.cnt ) / b.cnt , 2 ) change_7_days
from (
    SELECT timestamp::date, COUNT(DISTINCT userid)  cnt
    FROM logs
    GROUP BY timestamp::date
    ORDER BY timestamp::date 
) a
left join (
    SELECT timestamp::date, COUNT(DISTINCT userid)  cnt
    FROM logs
    GROUP BY timestamp::date
    ORDER BY timestamp::date 
) b
ON a.timestamp = b.timestamp - 7
;


You can also try another version - this one should be faster,
because it seems that postgresql is not smart enought and evaluates the same subquery twice,
instead of cashing results in memory or temp table.
WITH clause helps to avoid this (compare plans below).

with src as (
    SELECT timestamp::date, COUNT(DISTINCT userid)  cnt
    FROM logs
    GROUP BY timestamp::date
    ORDER BY timestamp::date 
)
SELECT a.timestamp, 
       a.cnt, 
       b.cnt cnt_minus_7_day, 
       round( 100.0 *( a.cnt - b.cnt ) / b.cnt , 2 ) change_7_days
FROM src a
left join src b
on a.timestamp = b.timestamp - 7


Here is a plan for the first query (running on my sample data):

"Hash Left Join  (cost=5136.71..5350.93 rows=101 width=20) (actual time=77.778..88.676 rows=101 loops=1)"
"  Hash Cond: (public.logs."timestamp" = (b."timestamp" - 7))"
"  ->  GroupAggregate  (cost=2462.13..2672.31 rows=101 width=8) (actual time=44.398..55.129 rows=101 loops=1)"
"        ->  Sort  (cost=2462.13..2531.85 rows=27889 width=8) (actual time=44.290..48.392 rows=27889 loops=1)"
"              Sort Key: public.logs."timestamp""
"              Sort Method: external merge  Disk: 488kB"
"              ->  Seq Scan on logs  (cost=0.00..402.89 rows=27889 width=8) (actual time=0.037..10.396 rows=27889 loops=1)"
"  ->  Hash  (cost=2673.32..2673.32 rows=101 width=12) (actual time=33.355..33.355 rows=101 loops=1)"
"        Buckets: 1024  Batches: 1  Memory Usage: 5kB"
"        ->  Subquery Scan on b  (cost=2462.13..2673.32 rows=101 width=12) (actual time=22.883..33.306 rows=101 loops=1)"
"              ->  GroupAggregate  (cost=2462.13..2672.31 rows=101 width=8) (actual time=22.881..33.288 rows=101 loops=1)"
"                    ->  Sort  (cost=2462.13..2531.85 rows=27889 width=8) (actual time=22.817..26.507 rows=27889 loops=1)"
"                          Sort Key: public.logs."timestamp""
"                          Sort Method: external merge  Disk: 488kB"
"                          ->  Seq Scan on logs  (cost=0.00..402.89 rows=27889 width=8) (actual time=0.014..3.696 rows=27889 loops=1)"
"Total runtime: 100.360 ms"

and for second version:

"Hash Left Join  (cost=2675.59..2680.64 rows=101 width=20) (actual time=60.612..60.785 rows=101 loops=1)"
"  Hash Cond: (a."timestamp" = (b."timestamp" - 7))"
"  CTE src"
"    ->  GroupAggregate  (cost=2462.13..2672.31 rows=101 width=8) (actual time=46.498..60.425 rows=101 loops=1)"
"          ->  Sort  (cost=2462.13..2531.85 rows=27889 width=8) (actual time=46.382..51.113 rows=27889 loops=1)"
"                Sort Key: logs."timestamp""
"                Sort Method: external merge  Disk: 488kB"
"                ->  Seq Scan on logs  (cost=0.00..402.89 rows=27889 width=8) (actual time=0.037..8.945 rows=27889 loops=1)"
"  ->  CTE Scan on src a  (cost=0.00..2.02 rows=101 width=12) (actual time=46.504..46.518 rows=101 loops=1)"
"  ->  Hash  (cost=2.02..2.02 rows=101 width=12) (actual time=14.084..14.084 rows=101 loops=1)"
"        Buckets: 1024  Batches: 1  Memory Usage: 5kB"
"        ->  CTE Scan on src b  (cost=0.00..2.02 rows=101 width=12) (actual time=0.002..14.033 rows=101 loops=1)"
"Total runtime: 67.799 ms"
krokodilko
  • 35,300
  • 7
  • 55
  • 79
3

You don't actually need either subquery or CTE. You can do with a single SELECT using the window function lag():

I use ts as column name instead of timestmap because it's unwise to use reserved words (SQL standard) or Postgres function / type names as identifiers.

SELECT ts::date
      ,     ((count(DISTINCT userid) * 10000)
        / lag(count(DISTINCT userid), 7) OVER (ORDER BY ts::date))::real
        / 100 - 100 AS pct_change_since_7_days_ago
      ,count(DISTINCT userid) AS ct
      ,lag(count(DISTINCT userid), 7) OVER (ORDER BY ts::date) AS ct_7_days_ago
FROM   logs
GROUP  BY 1
ORDER  BY 1 DESC;
  • I arranged the computation of the percentage for performance. This way we get a rounded precision of 2 fractional digits without using the function round() which would also require a cast to numeric.

  • Window functions can be applied to aggregate functions at the same query level, that's why lag(count(DISTINCT userid), 7) OVER (ORDER BY ts::date) works. The window functions lead() and lag() take additional parameters. I pick the row 7

  • Note: this requires at least one row per day, or it will miscalculate. If there can be gaps, the second query of @kordirko would be my choice, just without the ORDER BY in the CTE, which should be applied in the outer query.
    Or you could create a list of days with generate_series() and LEFT JOIN to it. Like demonstrated here:
    Retrieving row count and returning 0 when no rows

Division by 0

If no row exists for "7 days earlier", the result is NULL - for the LEFT JOIN in @kordirko´s version as well as for lag() - which represents reality well ("count is unknown") and serves as automatic protection against division by 0.

However, if userid can be NULL, a division by 0 becomes possible and we need to catch the case. Why the paradox effect?

  • Unlike other aggregate functions, count() never returns NULL. Instead NULL values are just not counted.

  • But if no rows are found for "7 days ago", we get NULL for the count, because the whole expression is NULL: count() isn't even executed - which happens to work perfectly for us in this case.

  • However, if one or more rows are found, but with userid IS NULL, we get a count of 0, which would raise an exception for the division by 0.

For a plain count(userid) we could use count(*) instead to prevent the case. But that's just not possible for count(DISTINCT userid) - and may or may not return the count you are looking for.

Use NULLIF(count(DISTINCT userid), 0) in this case.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228