1

I have a table in Postgres with categories and values. I want to perform an aggregation, e.g. avg(value) per category but including the first row of the next category in the aggregation.

Sample table:

id  category  value
-------------------
1   1         5.4
2   1         2.1
3   2         1.0
4   2         2.6
5   2         0.3
6   3         4.4
7   3         3.8

id is a primary key and provides an order. Categories are grouped within the order and consecutive.
It would be acceptable (but not necessary) to create an intermediate table like this, which is duplicating the adjacent row:

id  category  value
-------------------
1   1         5.4
2   1         2.1
3   1         1.0  <-- new row
4   2         1.0
5   2         2.6
6   2         0.3
7   2         4.4  <-- new row
8   3         4.4
9   3         3.8

... and then do:

select category, avg(value) group by category from sample_table

How can this be achieved with SQL statements?

I suspect this could be done with window functions and some sophisticated frame clause like GROUPS, but I don't know how. ( See https://www.postgresql.org/docs/12/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS )

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Michael S
  • 738
  • 1
  • 8
  • 20

1 Answers1

2

You confirmed that category numbers are increasing by 1 steadily, no gaps.
Here is a simple approach for the simple case:

SELECT category, avg(value)
FROM  (
   SELECT category, value
   FROM   tbl

   UNION ALL
   (  -- parentheses required
   SELECT DISTINCT ON (category)
          category - 1, value
   FROM   tbl
   WHERE  category > (SELECT min(category) FROM tbl)  -- eliminate corner case
   ORDER  BY category, id
   )   
   ) sub
GROUP  BY 1
ORDER  BY 1;

The second term after UNION ALL adds the row like you suggested: I take the first row of each group and simply subtract 1 from the category.

Corner case: adds a new category with min(category) - 1. Can be eliminated easily ...


Generic solution for any kind of categories (as long as the order is defined):

SELECT category, avg(value)
FROM  (
   SELECT category, value
   FROM   tbl

   UNION ALL
   SELECT lag(category) OVER (ORDER BY category), value
   FROM  (
      SELECT DISTINCT ON (category)
             category, value
      FROM   tbl
      ORDER  BY category, id
      ) unicat
   ) sub
WHERE  category IS NOT NULL  -- eliminate corner case
GROUP  BY 1
ORDER  BY 1;

The first value of each group is added to to the previous category using the window function lag().

About DISTINCT ON:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    Hi Erwin, this is great stuff of combining DISTINCT, lag() and UNION. DISTINCT ON (category) selects the first row of a category group. And with lag(category) you are assigning this 'new' row to previous category. I like the second generic solution very much above the first because it is more generic. – Michael S Jan 07 '21 at 20:42
  • @Michael Yeah! So I called it "generic" :) The first one is faster though - if applicable. – Erwin Brandstetter Jan 07 '21 at 21:34