1

I have a table with influencers and their follower counter for each day:

influencer_id |     date     |    followers
     1        | 2020-05-29   |      7361
     1        | 2020-05-28   |      7234
                    ...
     2        | 2020-05-29   |       82
     2        | 2020-05-28   |       85
                    ...
     3        | 2020-05-29   |      3434
     3        | 2020-05-28   |      2988
     3        | 2020-05-27   |      2765
                    ...

Let's say I want to calculate how many followers each individual influencer has gained in the last 7 days and get the following table:

influencer_id |                       growth
     1        |  <num followers last day - num followers first day>
     2        |                         "
     3        |                         "

As a first attempt I did this:

SELECT influencer_id,
      (MAX(followers) - MIN(followers)) AS growth
FROM influencer_follower_daily
WHERE date < '2020-05-30'
AND date >= '2020-05-23'
GROUP BY influencer_id;

This works and shows the growth over the week for each influencer. But it assumes the follower count always increases and people never unfollow!

So is there a way to achieve what I want using an SQL query over the original table? Or will I have to generate a completely new table using a FOR loop that calculates a +/- follower change column between each date?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
ddriver1
  • 723
  • 10
  • 18

2 Answers2

2

The simple aggregate functions first() and last() are not implemented in standard Postgres. But see below.

1. array_agg()

Gordon demonstrated a query with array_agg(), but that's more expensive than necessary, especially with many rows per group. Even more so when called twice, and with ORDER BY per aggregate. This equivalent alternative should be substantially faster:

SELECT influencer_id, arr[array_upper(arr, 1)] - arr[1]
FROM  (
   SELECT influencer_id, array_agg(followers) AS arr
   FROM  (
      SELECT influencer_id, followers
      FROM   influencer_follower_daily
      WHERE  date >= '2020-05-23'
      AND    date <  '2020-05-30'
      ORDER  BY influencer_id, date
      ) sub1
   GROUP  BY influencer_id
   ) sub2;

Because it sorts once and aggregates once. The sort order of the inner subquery sub1 is carried over to the next level. See:

Indexes matter:

  • If you query the whole table or most of it, an index on (influencer_id, date, followers) can help (a lot) with index-only scans.

  • If you query only a small fragment of the table, an index on (date) or (date, influencer_id, followers) can help (a lot).

2. DISTINCT & window functions

Gordon also demonstrated DISTINCT with window functions. Again, can be substantially faster:

SELECT DISTINCT ON (influencer_id)
       influencer_id
     , last_value(followers) OVER (PARTITION BY influencer_id ORDER BY date
                                   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
     - followers AS growth
FROM   influencer_follower_daily
WHERE  date >= '2020-05-23'
AND    date <  '2020-05-30'
ORDER  BY influencer_id, date;

With a single window function, using the same sort order (!) as the main query. To achieve this, we need the non-default window definition with ROWS BETWEEN ... See:

And DISTINCT ON instead of DISTINCT. See:

3. Custom aggregate functions

first() and last()

You can add those yourself, it's pretty simple. See instructions in the Postgres Wiki.
Or install the additional module first_last_agg with a faster implementation in C.

Related:

Then your query becomes simpler:

SELECT influencer_id, last(followers) - first(followers) AS growth
FROM  (
   SELECT influencer_id, followers
   FROM   influencer_follower_daily 
   WHERE  date >= '2020-03-02'
   AND    date <  '2020-05-09'
   ORDER  BY influencer_id, date
   ) z
GROUP  BY influencer_id
ORDER  BY influencer_id;

Custom aggregate growth()

You can combine first() and last() in a single aggregate function. That's faster, but calling two C functions will still outperform one custom SQL function.

Basically encapsulates the logic of my first query in a custom aggregate:

CREATE OR REPLACE FUNCTION f_growth(anyarray)
  RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT $1[array_upper($1, 1)] - $1[1]';

CREATE OR REPLACE AGGREGATE growth(anyelement) (
   SFUNC     = array_append
 , STYPE     = anyarray
 , FINALFUNC = f_growth
 , PARALLEL  = SAFE
);

Works for any numeric type (or any type with an operator type - type returning the same type). The query is simpler, yet:

SELECT influencer_id, growth(followers)
FROM  (
   SELECT influencer_id, followers
   FROM   influencer_follower_daily 
   WHERE  date >= '2020-05-23'
   AND    date <  '2020-05-30'
   ORDER  BY influencer_id, date
   ) z
GROUP  BY influencer_id
ORDER  BY influencer_id;

Or a little slower, but ultimately short:

SELECT influencer_id, growth(followers ORDER BY date)
FROM   influencer_follower_daily 
WHERE  date >= '2020-05-23'
AND    date <  '2020-05-30'
GROUP  BY 1
ORDER  BY 1;

db<>fiddle here

4. Performance optimization for many rows per group

With many rows per group / partition, other query techniques can be (a lot) faster. Techniques along these lines:

If that applies, I suggest you start a new question disclosing exact table definition(s) and cardinalities ...


Closely related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • @ddriver1: I would be interested in the relative performance for the provided queries. Could you run `EXPLAIN (ANALYZE, TIMING OFF)` with each (best of 5 to level caching artifacts) and add an answer? (Plus table & index definition and cardinalities if that's not too much to ask.) – Erwin Brandstetter Jun 03 '20 at 02:06
  • 1
    Many thanks for the incredibly detailed answer. The table has indexes on columns ID and influencer_id, which is a FK to a separate influencer table. The other columns are followers and day. I did a best of 10 for each of the queries you provided. They all produced the correct result and sorted 66293 rows at some point. – ddriver1 Jun 05 '20 at 21:08
  • 1
    Here are the results: 1. array_agg(): 103.2ms 2. Distinct+window: 137.1ms 3. first() and last() using portable SQL (not C): 256.3ms 4. custom aggregate growth(): 166.8ms (5. Gordon's distinct+window: 284.4ms) – ddriver1 Jun 05 '20 at 21:09
  • @ddriver1: Thanks for coming back with test results! Much like I expected. – Erwin Brandstetter Jun 06 '20 at 01:45
1

Postgres doesn't have a first()/last() aggregation function. One method is:

SELECT DISTINCT influencer_id,
       ( FIRST_VALUE(followers) OVER (PARTITION BY influencer_id ORDER BY DATE DESC) -
         FIRST_VALUE(followers) OVER (PARTITION BY influencer_id ORDER BY DATE ASC)
       ) as growth
FROM influencer_follower_daily
WHERE date < '2020-05-30' AND date >= '2020-05-23';

Another alternative is to use arrays:

SELECT influencer_id,
       ( ARRAY_AGG(followers ORDER BY DATE DESC) )[1] -
         ARRAY_AGG(followers ORDER BY DATE ASC) )[1]
       ) as growth
FROM influencer_follower_daily
WHERE date < '2020-05-30' AND date >= '2020-05-23'
GROUP BY influencer_id;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • This works! By the way, you added an accidental minus (-) symbol after DATE ASC) in the first query. I tried to edit it but the edit is too small. – ddriver1 Jun 02 '20 at 16:25