Calculating follower growth over time for each influencer

Question

I have a table with influencers and their follower counter for each day:

influencer_id |     date     |    followers
     1        | 2020-05-29   |      7361
     1        | 2020-05-28   |      7234
                    ...
     2        | 2020-05-29   |       82
     2        | 2020-05-28   |       85
                    ...
     3        | 2020-05-29   |      3434
     3        | 2020-05-28   |      2988
     3        | 2020-05-27   |      2765
                    ...

Let's say I want to calculate how many followers each individual influencer has gained in the last 7 days and get the following table:

influencer_id |                       growth
     1        |  <num followers last day - num followers first day>
     2        |                         "
     3        |                         "

As a first attempt I did this:

SELECT influencer_id,
      (MAX(followers) - MIN(followers)) AS growth
FROM influencer_follower_daily
WHERE date < '2020-05-30'
AND date >= '2020-05-23'
GROUP BY influencer_id;

This works and shows the growth over the week for each influencer. But it assumes the follower count always increases and people never unfollow!

So is there a way to achieve what I want using an SQL query over the original table? Or will I have to generate a completely new table using a FOR loop that calculates a +/- follower change column between each date?

score 2 · Accepted Answer · answered Jun 03 '20 at 02:02

The simple aggregate functions first() and last() are not implemented in standard Postgres. But see below.

1. `array_agg()`

Gordon demonstrated a query with array_agg(), but that's more expensive than necessary, especially with many rows per group. Even more so when called twice, and with ORDER BY per aggregate. This equivalent alternative should be substantially faster:

SELECT influencer_id, arr[array_upper(arr, 1)] - arr[1]
FROM  (
   SELECT influencer_id, array_agg(followers) AS arr
   FROM  (
      SELECT influencer_id, followers
      FROM   influencer_follower_daily
      WHERE  date >= '2020-05-23'
      AND    date <  '2020-05-30'
      ORDER  BY influencer_id, date
      ) sub1
   GROUP  BY influencer_id
   ) sub2;

Because it sorts once and aggregates once. The sort order of the inner subquery sub1 is carried over to the next level. See:

How to apply ORDER BY and LIMIT in combination with an aggregate function?

Indexes matter:

If you query the whole table or most of it, an index on (influencer_id, date, followers) can help (a lot) with index-only scans.
If you query only a small fragment of the table, an index on (date) or (date, influencer_id, followers) can help (a lot).

2. `DISTINCT` & window functions

Gordon also demonstrated DISTINCT with window functions. Again, can be substantially faster:

SELECT DISTINCT ON (influencer_id)
       influencer_id
     , last_value(followers) OVER (PARTITION BY influencer_id ORDER BY date
                                   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
     - followers AS growth
FROM   influencer_follower_daily
WHERE  date >= '2020-05-23'
AND    date <  '2020-05-30'
ORDER  BY influencer_id, date;

With a single window function, using the same sort order (!) as the main query. To achieve this, we need the non-default window definition with ROWS BETWEEN ... See:

PostgreSQL query with max and min date plus associated id per row

And DISTINCT ON instead of DISTINCT. See:

Select first row in each GROUP BY group?

3. Custom aggregate functions

`first()` and `last()`

You can add those yourself, it's pretty simple. See instructions in the Postgres Wiki.
Or install the additional module first_last_agg with a faster implementation in C.

Use something like TOP with GROUP BY

Then your query becomes simpler:

SELECT influencer_id, last(followers) - first(followers) AS growth
FROM  (
   SELECT influencer_id, followers
   FROM   influencer_follower_daily 
   WHERE  date >= '2020-03-02'
   AND    date <  '2020-05-09'
   ORDER  BY influencer_id, date
   ) z
GROUP  BY influencer_id
ORDER  BY influencer_id;

Custom aggregate `growth()`

You can combine first() and last() in a single aggregate function. That's faster, but calling two C functions will still outperform one custom SQL function.

Basically encapsulates the logic of my first query in a custom aggregate:

CREATE OR REPLACE FUNCTION f_growth(anyarray)
  RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT $1[array_upper($1, 1)] - $1[1]';

CREATE OR REPLACE AGGREGATE growth(anyelement) (
   SFUNC     = array_append
 , STYPE     = anyarray
 , FINALFUNC = f_growth
 , PARALLEL  = SAFE
);

Works for any numeric type (or any type with an operator type - type returning the same type). The query is simpler, yet:

SELECT influencer_id, growth(followers)
FROM  (
   SELECT influencer_id, followers
   FROM   influencer_follower_daily 
   WHERE  date >= '2020-05-23'
   AND    date <  '2020-05-30'
   ORDER  BY influencer_id, date
   ) z
GROUP  BY influencer_id
ORDER  BY influencer_id;

Or a little slower, but ultimately short:

SELECT influencer_id, growth(followers ORDER BY date)
FROM   influencer_follower_daily 
WHERE  date >= '2020-05-23'
AND    date <  '2020-05-30'
GROUP  BY 1
ORDER  BY 1;

db<>fiddle here

4. Performance optimization for many rows per group

With many rows per group / partition, other query techniques can be (a lot) faster. Techniques along these lines:

Optimize GROUP BY query to retrieve latest row per user

If that applies, I suggest you start a new question disclosing exact table definition(s) and cardinalities ...

Closely related:

@ddriver1: I would be interested in the relative performance for the provided queries. Could you run `EXPLAIN (ANALYZE, TIMING OFF)` with each (best of 5 to level caching artifacts) and add an answer? (Plus table & index definition and cardinalities if that's not too much to ask.) — Erwin Brandstetter, Jun 03 '20 at 02:06
Many thanks for the incredibly detailed answer. The table has indexes on columns ID and influencer_id, which is a FK to a separate influencer table. The other columns are followers and day. I did a best of 10 for each of the queries you provided. They all produced the correct result and sorted 66293 rows at some point. — ddriver1, Jun 05 '20 at 21:08
Here are the results: 1. array_agg(): 103.2ms 2. Distinct+window: 137.1ms 3. first() and last() using portable SQL (not C): 256.3ms 4. custom aggregate growth(): 166.8ms (5. Gordon's distinct+window: 284.4ms) — ddriver1, Jun 05 '20 at 21:09
@ddriver1: Thanks for coming back with test results! Much like I expected. — Erwin Brandstetter, Jun 06 '20 at 01:45

Gordon Linoff · Answer 2 · 2020-06-02T17:06:10.847

Postgres doesn't have a first()/last() aggregation function. One method is:

SELECT DISTINCT influencer_id,
       ( FIRST_VALUE(followers) OVER (PARTITION BY influencer_id ORDER BY DATE DESC) -
         FIRST_VALUE(followers) OVER (PARTITION BY influencer_id ORDER BY DATE ASC)
       ) as growth
FROM influencer_follower_daily
WHERE date < '2020-05-30' AND date >= '2020-05-23';

Another alternative is to use arrays:

SELECT influencer_id,
       ( ARRAY_AGG(followers ORDER BY DATE DESC) )[1] -
         ARRAY_AGG(followers ORDER BY DATE ASC) )[1]
       ) as growth
FROM influencer_follower_daily
WHERE date < '2020-05-30' AND date >= '2020-05-23'
GROUP BY influencer_id;

This works! By the way, you added an accidental minus (-) symbol after DATE ASC) in the first query. I tried to edit it but the edit is too small. — ddriver1, Jun 02 '20 at 16:25

Calculating follower growth over time for each influencer

2 Answers2

1. `array_agg()`

2. `DISTINCT` & window functions

3. Custom aggregate functions

`first()` and `last()`

Custom aggregate `growth()`

4. Performance optimization for many rows per group

Linked

Calculating follower growth over time for each influencer

2 Answers2

1. array_agg()

2. DISTINCT & window functions

3. Custom aggregate functions

first() and last()

Custom aggregate growth()

4. Performance optimization for many rows per group

Linked

1. `array_agg()`

2. `DISTINCT` & window functions

`first()` and `last()`

Custom aggregate `growth()`