The simple aggregate functions first()
and last()
are not implemented in standard Postgres. But see below.
1. array_agg()
Gordon demonstrated a query with array_agg()
, but that's more expensive than necessary, especially with many rows per group. Even more so when called twice, and with ORDER BY
per aggregate. This equivalent alternative should be substantially faster:
SELECT influencer_id, arr[array_upper(arr, 1)] - arr[1]
FROM (
SELECT influencer_id, array_agg(followers) AS arr
FROM (
SELECT influencer_id, followers
FROM influencer_follower_daily
WHERE date >= '2020-05-23'
AND date < '2020-05-30'
ORDER BY influencer_id, date
) sub1
GROUP BY influencer_id
) sub2;
Because it sorts once and aggregates once. The sort order of the inner subquery sub1
is carried over to the next level. See:
Indexes matter:
If you query the whole table or most of it, an index on (influencer_id, date, followers)
can help (a lot) with index-only scans.
If you query only a small fragment of the table, an index on (date)
or (date, influencer_id, followers)
can help (a lot).
2. DISTINCT
& window functions
Gordon also demonstrated DISTINCT
with window functions. Again, can be substantially faster:
SELECT DISTINCT ON (influencer_id)
influencer_id
, last_value(followers) OVER (PARTITION BY influencer_id ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
- followers AS growth
FROM influencer_follower_daily
WHERE date >= '2020-05-23'
AND date < '2020-05-30'
ORDER BY influencer_id, date;
With a single window function, using the same sort order (!) as the main query. To achieve this, we need the non-default window definition with ROWS BETWEEN ...
See:
And DISTINCT ON
instead of DISTINCT
. See:
3. Custom aggregate functions
first()
and last()
You can add those yourself, it's pretty simple. See instructions in the Postgres Wiki.
Or install the additional module first_last_agg
with a faster implementation in C.
Related:
Then your query becomes simpler:
SELECT influencer_id, last(followers) - first(followers) AS growth
FROM (
SELECT influencer_id, followers
FROM influencer_follower_daily
WHERE date >= '2020-03-02'
AND date < '2020-05-09'
ORDER BY influencer_id, date
) z
GROUP BY influencer_id
ORDER BY influencer_id;
Custom aggregate growth()
You can combine first()
and last()
in a single aggregate function. That's faster, but calling two C functions will still outperform one custom SQL function.
Basically encapsulates the logic of my first query in a custom aggregate:
CREATE OR REPLACE FUNCTION f_growth(anyarray)
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT $1[array_upper($1, 1)] - $1[1]';
CREATE OR REPLACE AGGREGATE growth(anyelement) (
SFUNC = array_append
, STYPE = anyarray
, FINALFUNC = f_growth
, PARALLEL = SAFE
);
Works for any numeric type (or any type with an operator type - type
returning the same type). The query is simpler, yet:
SELECT influencer_id, growth(followers)
FROM (
SELECT influencer_id, followers
FROM influencer_follower_daily
WHERE date >= '2020-05-23'
AND date < '2020-05-30'
ORDER BY influencer_id, date
) z
GROUP BY influencer_id
ORDER BY influencer_id;
Or a little slower, but ultimately short:
SELECT influencer_id, growth(followers ORDER BY date)
FROM influencer_follower_daily
WHERE date >= '2020-05-23'
AND date < '2020-05-30'
GROUP BY 1
ORDER BY 1;
db<>fiddle here
4. Performance optimization for many rows per group
With many rows per group / partition, other query techniques can be (a lot) faster. Techniques along these lines:
If that applies, I suggest you start a new question disclosing exact table definition(s) and cardinalities ...
Closely related: