I'm trying to figure out a way to reason about a fairly simple problem:
A table
foo
has two columns,id
with unique identifiers, andvalue
with a randomnumeric
value. Given a constantthreshold
value, find the groups of values where theirvalue
is within that threshold - the count of that group, and the group's average value.
Let's set the stage - here is the foo
table (in PostgreSQL):
CREATE TABLE foo (
id serial PRIMARY KEY,
value numeric NOT NULL
);
INSERT INTO foo(value) VALUES (1),(2),(2),(3),(6),(7),(7),(8);
Which gives the following data:
id | value |
---|---|
1 | 1 |
2 | 2 |
3 | 2 |
4 | 3 |
5 | 6 |
6 | 7 |
7 | 7 |
8 | 8 |
The query should return counts and averages - entries cannot be double-counted, so I'd like the rows to be "cut into groups" based on some threshold "near-ness" value - you can think of the threshold
value as a radius (for my example, a radius of 1.5
would be what I want):
SELECT AVG(foo.value), COUNT(foo.id)
FROM foo
GROUP BY foo.value; -- where I'm stuck
The query above will only apply the aggregate functions to rows in which foo.value
has exact duplicates - the following is the result of this query:
AVG(foo.value) | COUNT(foo.id) |
---|---|
1 | 1 |
2 | 2 |
3 | 1 |
6 | 1 |
7 | 2 |
8 | 1 |
what I'd like is something like the following pseudo-SQL:
SELECT AVG(foo.value), COUNT(foo.id)
FROM foo
GROUP BY
(real(row_being_grouped.value) <= real(foo.value) + 1.5)
AND
(real(row_being_grouped.value) >= real(foo.value) - 1.5);
Where the result would look like the following:
AVG(foo.value) | COUNT(foo.id) |
---|---|
2 | 4 |
7 | 4 |
I'm not sure if this makes any sense. I am wondering if I can get away with this without doing a sub-query - maybe "bucket" the rows somehow before finding the average within the buckets?
EDIT:
Before writing this question, I was unaware of window functions, and see how critical they are to this question - window functions could be used to process this data in a sub-query.