Inexact GROUP BY aggregation

Question

I'm trying to figure out a way to reason about a fairly simple problem:

A table foo has two columns, id with unique identifiers, and value with a random numeric value. Given a constant threshold value, find the groups of values where their value is within that threshold - the count of that group, and the group's average value.

Let's set the stage - here is the foo table (in PostgreSQL):

CREATE TABLE foo (
    id serial PRIMARY KEY,
    value numeric NOT NULL
);
INSERT INTO foo(value) VALUES (1),(2),(2),(3),(6),(7),(7),(8);

Which gives the following data:

id	value
1	1
2	2
3	2
4	3
5	6
6	7
7	7
8	8

The query should return counts and averages - entries cannot be double-counted, so I'd like the rows to be "cut into groups" based on some threshold "near-ness" value - you can think of the threshold value as a radius (for my example, a radius of 1.5 would be what I want):

SELECT AVG(foo.value), COUNT(foo.id)
  FROM foo
  GROUP BY foo.value; -- where I'm stuck

The query above will only apply the aggregate functions to rows in which foo.value has exact duplicates - the following is the result of this query:

AVG(foo.value)	COUNT(foo.id)
1	1
2	2
3	1
6	1
7	2
8	1

what I'd like is something like the following pseudo-SQL:

SELECT AVG(foo.value), COUNT(foo.id)
  FROM foo
  GROUP BY
   (real(row_being_grouped.value) <= real(foo.value) + 1.5)
   AND
   (real(row_being_grouped.value) >= real(foo.value) - 1.5);

Where the result would look like the following:

AVG(foo.value)	COUNT(foo.id)
2	4
7	4

I'm not sure if this makes any sense. I am wondering if I can get away with this without doing a sub-query - maybe "bucket" the rows somehow before finding the average within the buckets?

EDIT:

Before writing this question, I was unaware of window functions, and see how critical they are to this question - window functions could be used to process this data in a sub-query.

Clustering is a hard problem that doesn't have a single, well-defined solution. So you cannot expect SQL to do that. Perhaps there can be a solution if you specify how *exactly* the groups should be formed. — Laurenz Albe, Aug 03 '23 at 15:44
That's what I thought. I could imaging the clustering operation being done from a right or left fold, from a functional programming stand point, but I'm not sure how I could do it with SQL. Do you have any recommendation for technologies that support some kind of stats analysis like this? — Athan Clark, Aug 03 '23 at 16:07
group by is for aggregation fixed set o f rules for example avg ove r10 years, but want to group by a dynamic range, this will be impossible, so edit your question and show some data and wanted result — nbk, Aug 03 '23 at 16:41
Please add a few rows of sample data and the expected result. — The Impaler, Aug 03 '23 at 16:44
Please clarify how clusters are defined. For example, if the threshold is 1 and there are four values, 1, 2, 3, and 4, then what are the clusters? Are the centers of the clusters predefined, or are they selected based on additional criteria; e.g., minimal number of clusters, most evenly distributed centers, etc.? You stated that rows may not be double counted. By that criteria, the pseudo SQL you presented fails when the value are spaced threshold units apart. What might seem a simple problem initially is actually much more complex. — JohnH, Aug 04 '23 at 02:47

Erwin Brandstetter · Accepted Answer · 2023-08-03T22:59:05.733

If "threshold" can be understood as the maximum gap allowed between values of the same group, then it's a well defined gaps-and-islands problem.
Here is a solution:

SELECT grp, count(*) AS grp_count, round(avg(value), 2) AS grp_avg
FROM  (
   SELECT count(gap) OVER (ORDER BY value) AS grp, *
   FROM  (
      SELECT value
           , value - lag(value) OVER (ORDER BY value) > 150 OR null AS gap
      FROM   foo
      ) sub1
   ) sub2
GROUP  BY grp
ORDER  BY grp;

fiddle (with step-by-step demo)

Explanation and links to more:

How to group timestamps into islands (based on arbitrary gap)?

For convenience and short code I use Boolean logic (true OR null → true, false OR null → null), and the fact that count() ignores null values. See:

For absolute performance, is SUM faster or COUNT?

If this is more of a raster / cluster / granulation / grid problem, you need to define exactly which rows to pick as "focal points" or, alternatively, the exact nature of the independent raster / grid.

Although my solution wasn't exactly the same as this one, you are correct in that I misunderstood the problem - I needed to use window functions (which I wasn't aware of before posting this question) to generate a column that _could_ have an exact value to `GROUP` against (i.e. a `TRUE` / `NULL` value as you have). Thank you for your help! — Athan Clark, Aug 09 '23 at 11:19

SQLpro · Answer 2 · 2023-08-04T23:13:57.697

I suppose that treshold value is 123.45... Then :

WITH 
threshold AS 
(SELECT 123.45 AS threshold_value, MIN(value) AS MI, MAX(value) AS MA 
 FROM foo),
slices AS
(SELECT threshold_value, GENERATE_SERIES AS boundary  
 FROM   threshold
        LATERAL GENERATE_SERIES(MI - threshold_value / 2.0, MA + threshold_value / 2.0, threshold_value) AS boundaries),
places AS 
(SELECT Id, val, boundary, threshold_value
 FROM   foo
        JOIN slices ON val >= boundary AND val < boundary + threshold_value)
SELECT  COUNT(val) AS COUNT_VAL, boundary AS BOUND_LOW, boundary + threshold_value AS BOUND_HIGH
FROM    places
GROUP   BY boundary, boundary + threshold_value
ORDER   BY BOUND_LOW;

Not tested...

This query divides the threshold by 2, which is most likely not the OP's intent since threshold is analogous to distance, not span or range. — JohnH, Aug 07 '23 at 13:28

Inexact GROUP BY aggregation

2 Answers2