How to get average of the 'middle' values in a group?

Question

I have a table that has values and group ids (simplified example). I need to get the average for each group of the middle 3 values. So, if there are 1, 2, or 3 values it's just the average. But if there are 4 values, it would exclude the highest, 5 values the highest and lowest, etc. I was thinking some sort of window function, but I'm not sure if it's possible.

http://www.sqlfiddle.com/#!11/af5e0/1

For this data:

TEST_ID TEST_VALUE  GROUP_ID
1       5           1
2       10          1
3       15          1
4       25          2
5       35          2
6       5           2
7       15          2
8       25          3
9       45          3
10      55          3
11      15          3
12      5           3
13      25          3
14      45          4

I'd like

GROUP_ID    AVG
1           10
2           15
3           21.6
4           45

Why do you want to do this? Just seems like an odd use requirment. As you scale up the number of rows into the millions, do you still want to average only 3 values? For even number of rows, do you always want to get rid of the highest value first, followed by the lowest? — Vulcronos, Sep 30 '13 at 18:22
@Vulcronos, there will be lots and lots of groups, but each group will contain between 0 and 10ish values, with the majority having 3-5 values. The values are results from testing, and the difference between the middle average and the overall average is of interest. — Barbara Laird, Sep 30 '13 at 18:32

score 6 · Accepted Answer · answered Sep 30 '13 at 18:39

6

Another option using analytic functions;

SELECT group_id,
       avg( test_value )
FROM (
  select t.*,
         row_number() over (partition by group_id order by test_value ) rn,
         count(*) over (partition by group_id  ) cnt
  from test t
) alias 
where 
   cnt <= 3
   or 
   rn between floor( cnt / 2 )-1 and ceil( cnt/ 2 ) +1
group by group_id
;

Demo --> http://www.sqlfiddle.com/#!11/af5e0/59

answered Sep 30 '13 at 18:39

krokodilko

35,300
7
55
79

This is a thing of beauty! – Barbara Laird Sep 30 '13 at 18:50
1

+1 for concise SQL, but -0.5 if I could for providing no explanation with your code. – IMSoP Sep 30 '13 at 18:55

score 2 · Answer 2 · answered Sep 30 '13 at 18:25

I'm not familiar with the Postgres syntax on windowed functions, but I was able to solve your problem in SQL Server with this SQL Fiddle. Maybe you'll be able to easily migrate this into Postgres-compatible code. Hope it helps!

A quick primer on how I worked it.

Order the test scores for each group
Get a count of items in each group
Use that as a subquery and select only the middle 3 items (that's the where clause in the outer query)
Get the average for each group

--

select  
  group_id,
  avg(test_value)
from (
  select 
    t.group_id, 
    convert(decimal,t.test_value) as test_value, 
    row_number() over (
      partition by t.group_id
      order by t.test_value
    ) as ord,
    g.gc
  from
    test t
    inner join (
      select group_id, count(*) as gc
      from test
      group by group_id
    ) g
      on t.group_id = g.group_id
  ) a
where
  ord >= case when gc <= 3 then 1 when gc % 2 = 1 then gc / 2 else (gc - 1) / 2 end
  and ord <= case when gc <= 3 then 3 when gc % 2 = 1 then (gc / 2) + 2 else ((gc - 1) / 2) + 2 end
group by
  group_id

I know - I'm not familiar with the exact postgres syntaxes/function names that I'm familiar with in SQL Server. If someone knows both and wants to translate, that'd be awesome. — Derek, Sep 30 '13 at 18:29
I opened your SQL Fiddle, selected PostgreSQL, and built and ran. The only thing that complained was the non-standard `CONVERT` function; I swapped that for a standard `CAST .. AS` and it worked fine: http://www.sqlfiddle.com/#!12/07bc9/2 — IMSoP, Sep 30 '13 at 18:30

Roman Pekar · Answer 3 · 2013-09-30T18:59:42.610

2

with cte as (
    select
        *,
        row_number() over(partition by group_id order by test_value) as rn,
        count(*) over(partition by group_id) as cnt
    from test
)
select
    group_id, avg(test_value)
from cte
where
    cnt <= 3 or
    (rn >= cnt / 2 - 1 and rn <= cnt / 2 + 1)
group by group_id
order by group_id

sql fiddle demo

in the cte, we need to get count of elements over each group_id by window function + calculate row_number inside each group_id. Then, if this count > 3 then we need to get middle of the group by dividing count by 2 and then get +1 and -1 element. If count <= 3, then we should just take all elements.

edited Sep 30 '13 at 18:59

answered Sep 30 '13 at 18:46

Roman Pekar

107,110
28
195
197

+1 for using a CTE, but -0.5 if I could for providing no explanation with your code. – IMSoP Sep 30 '13 at 18:55
@IMSoP actually, I really, really trying to write SQL which don't need comments :) , but added explanation anyway – Roman Pekar Sep 30 '13 at 19:00
Yeah, to be fair, it is pretty readable code :) Although you could have named your CTE better than "cte" :P – IMSoP Sep 30 '13 at 19:02
In quick testing with my actual data, the subquery is a little quicker than the CTE. But, a great solution! – Barbara Laird Sep 30 '13 at 19:03
@IMSoP yeah, i'm still in the middle of developing code convention for such a simple precalculation cte. I don't think that something like 'precalc' or 'aggr' would be better :) I'm open for suggestions BTW :) – Roman Pekar Sep 30 '13 at 19:04
@BarbaraLaird to be fair, solution with subquery was submitted before mine, so you did the right thing. Actually it's strange that cte is slower, may be it's because of ordering? – Roman Pekar Sep 30 '13 at 19:06
I think I'd go with something that gave a quick summary of what the CTE contains/adds; in this case, maybe `group_row_numbers` or `test_with_group_row_numbers`. You can alias it to a few letters in the actual `FROM`/`JOIN` clause like you would any table or view, but it makes it that little bit more self-documenting. – IMSoP Sep 30 '13 at 19:10
But then, I'm not a fan of super-short names like `rn` and `cnt` either - I'd probably have gone for `row_num` (or `group_num`?) and `row_count` (or `group_count`). – IMSoP Sep 30 '13 at 19:13
@IMSoP I actually want the name of cte to contain cte as well, so it'll be impossible to think that this is a real table. Have to think about a bit more, thanks for suggestion. I'm also should think about what happens when I have to add, for example a couple of aggregates, or switch row_number to dense_rank (at my work, SQL code could be changed frequently), and I don't want to rename cte every time – Roman Pekar Sep 30 '13 at 19:13
Hm, didn't think of the CTE changing use in place like that. The main places I've used CTEs, I've ended up chaining multiple CTEs together, and giving them "table-like" names makes the later parts a lot easier to follow. – IMSoP Sep 30 '13 at 19:16

score 1 · Answer 4 · answered Sep 30 '13 at 18:31

This works:

SELECT A.group_id, avg(A.test_value) AS avg_mid3 FROM
  (SELECT group_id,
         test_value,
         row_number() OVER (PARTITION BY group_id ORDER BY test_value) AS position
      FROM test) A
JOIN
  (SELECT group_id,
         CASE
           WHEN count(*) < 4 THEN 1
           WHEN count(*) % 2 = 0 THEN (count(*)/2 - 1)
           ELSE (count(*) / 2)
         END AS position_start,
         CASE
           WHEN count(*) < 4 THEN count(*)
           WHEN count(*) % 2 = 0 THEN (count(*)/2 + 1)
           ELSE (count(*) / 2 + 2)
         END AS position_end
         FROM test GROUP BY group_id) B
  ON A.group_id=B.group_id 
  AND A.position >= B.position_start 
  AND A.position <= B.position_end
GROUP BY A.group_id

Fiddle link: http://www.sqlfiddle.com/#!11/af5e0/56

Thanks! I couldn't figure out how to oder them from the middle out, but that case statement works. — Barbara Laird, Sep 30 '13 at 18:45

kapandron · Answer 5 · 2013-09-30T18:26:24.180

If you need to calculate the average values for groups then you can do this:

SELECT CASE WHEN NUMBER_FIRST_GROUP <> 0 
               THEN SUM_FIRST_GROUP / NUMBER_FIRST_GROUP 
               ELSE NULL
       END AS AVG_FIRST_GROUP,
       CASE WHEN NUMBER_SECOND_GROUP <> 0 
               THEN SUM_SECOND_GROUP / NUMBER_SECOND_GROUP 
               ELSE NULL
       END AS AVG_SECOND_GROUP,
       CASE WHEN NUMBER_THIRD_GROUP <> 0 
               THEN SUM_THIRD_GROUP / NUMBER_THIRD_GROUP 
               ELSE NULL
       END AS AVG_THIRD_GROUP,
       CASE WHEN NUMBER_FOURTH_GROUP <> 0 
               THEN SUM_FOURTH_GROUP / NUMBER_FOURTH_GROUP 
               ELSE NULL
       END AS AVG_FOURTH_GROUP
FROM (
      SELECT 
         SUM(CASE WHEN GROUP_ID = 1 THEN 1 ELSE 0 END) AS NUMBER_FIRST_GROUP,
         SUM(CASE WHEN GROUP_ID = 1 THEN TEST_VALUE ELSE 0 END) AS SUM_FIRST_GROUP,
         SUM(CASE WHEN GROUP_ID = 2 THEN 1 ELSE 0 END) AS NUMBER_SECOND_GROUP,
         SUM(CASE WHEN GROUP_ID = 2 THEN TEST_VALUE ELSE 0 END) AS SUM_SECOND_GROUP,
         SUM(CASE WHEN GROUP_ID = 3 THEN 1 ELSE 0 END) AS NUMBER_THIRD_GROUP,
         SUM(CASE WHEN GROUP_ID = 3 THEN TEST_VALUE ELSE 0 END) AS SUM_THIRD_GROUP,
         SUM(CASE WHEN GROUP_ID = 4 THEN 1 ELSE 0 END) AS NUMBER_FOURTH_GROUP,
         SUM(CASE WHEN GROUP_ID = 4 THEN TEST_VALUE ELSE 0 END) AS SUM_FOURTH_GROUP
     FROM TEST
     ) AS FOO

This won't work well for me, since I'll have lots and lots of groups. But, thank you. — Barbara Laird, Sep 30 '13 at 18:44

How to get average of the 'middle' values in a group?

5 Answers5

Linked