36

If I have a table like this:

pkey   age
----   ---
   1     8
   2     5
   3    12
   4    12
   5    22

I can "group by" to get a count of each age.

select age,count(*) n from tbl group by age;
age  n
---  -
  5  1
  8  1
 12  2
 22  1

What query can I use to group by age ranges?

  age  n
-----  -
 1-10  2
11-20  2
20+    1

I'm on 10gR2, but I'd be interested in any 11g-specific approaches as well.

zb226
  • 9,586
  • 6
  • 49
  • 79
Mark Harrison
  • 297,451
  • 125
  • 333
  • 465

10 Answers10

65
SELECT CASE 
         WHEN age <= 10 THEN '1-10' 
         WHEN age <= 20 THEN '11-20' 
         ELSE '21+' 
       END AS age, 
       COUNT(*) AS n
FROM age
GROUP BY CASE 
           WHEN age <= 10 THEN '1-10' 
           WHEN age <= 20 THEN '11-20' 
           ELSE '21+' 
         END
OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
Einstein
  • 4,450
  • 1
  • 23
  • 20
  • This should be the first and only answer to this question. Could use a little more formatting though. – jva Mar 20 '10 at 20:47
  • 2
    No, CASE statements use short circut evaluation – Einstein Mar 21 '10 at 05:06
  • How would short circut evaluation cause a problem in this query? Because the cases are ordered and use <= then the correct group is always picked. Isn't it? – Adrian Mar 22 '10 at 20:49
  • 1
    Adrian your correct, it was in reply to a previous comment that had since been removed. – Einstein Mar 22 '10 at 23:41
  • 1
    Is there a way to include a range with no rows. example: if there's no one above 20, the query return a row of (20+, 0)? – dcarneiro May 15 '12 at 17:12
29

Try:

select to_char(floor(age/10) * 10) || '-' 
|| to_char(ceil(age/10) * 10 - 1)) as age, 
count(*) as n from tbl group by floor(age/10);
Matthew Flaschen
  • 278,309
  • 50
  • 514
  • 539
  • 4
    clever usage of floor/division! – mpen Mar 20 '10 at 23:49
  • 1
    This approach is better when we have a defined pattern and groups can be calculated via an expression. It does not require to explicitly mention the groups in query and hence will be able to provide new groups without modifying the query .... – Nitin Midha Jun 18 '12 at 20:15
  • 2
    This does not work, it results in **error ORA-00979: not a GROUP BY expression** because `ceil(age/10)` is missing in the GROUP BY expression. But the direction of this approach is better as @NitinMidha wrote, so I am voting this answer up. – Wintermute Jul 17 '14 at 23:12
14

What you are looking for, is basically the data for a histogram.

You would have the age (or age-range) on the x-axis and the count n (or frequency) on the y-axis.

In the simplest form, one could simply count the number of each distinct age value like you already described:

SELECT age, count(*)
FROM tbl
GROUP BY age

When there are too many different values for the x-axis however, one may want to create groups (or clusters or buckets). In your case, you group by a constant range of 10.

We can avoid writing a WHEN ... THEN line for each range - there could be hundreds if it were not about age. Instead, the approach by @MatthewFlaschen is preferable for the reasons mentioned by @NitinMidha.

Now let's build the SQL...

First, we need to split the ages into range-groups of 10 like so:

  • 0-9
  • 10-19
  • 20 - 29
  • etc.

This can be achieved by dividing the age column by 10 and then calculating the result's FLOOR:

FLOOR(age/10)

"FLOOR returns the largest integer equal to or less than n" http://docs.oracle.com/cd/E11882_01/server.112/e26088/functions067.htm#SQLRF00643

Then we take the original SQL and replace age with that expression:

SELECT FLOOR(age/10), count(*)
FROM tbl
GROUP BY FLOOR(age/10)

This is OK, but we cannot see the range, yet. Instead we only see the calculated floor values which are 0, 1, 2 ... n.

To get the actual lower bound, we need to multiply it with 10 again so we get 0, 10, 20 ... n:

FLOOR(age/10) * 10

We also need the upper bound of each range which is lower bound + 10 - 1 or

FLOOR(age/10) * 10 + 10 - 1

Finally, we concatenate both into a string like this:

TO_CHAR(FLOOR(age/10) * 10) || '-' || TO_CHAR(FLOOR(age/10) * 10 + 10 - 1)

This creates '0-9', '10-19', '20-29' etc.

Now our SQL looks like this:

SELECT 
TO_CHAR(FLOOR(age/10) * 10) || ' - ' || TO_CHAR(FLOOR(age/10) * 10 + 10 - 1),
COUNT(*)
FROM tbl
GROUP BY FLOOR(age/10)

Finally, apply an order and nice column aliases:

SELECT 
TO_CHAR(FLOOR(age/10) * 10) || ' - ' || TO_CHAR(FLOOR(age/10) * 10 + 10 - 1) AS range,
COUNT(*) AS frequency
FROM tbl
GROUP BY FLOOR(age/10)
ORDER BY FLOOR(age/10)

However, in more complex scenarios, these ranges might not be grouped into constant chunks of size 10, but need dynamical clustering. Oracle has more advanced histogram functions included, see http://docs.oracle.com/cd/E16655_01/server.121/e15858/tgsql_histo.htm#TGSQL366

Credits to @MatthewFlaschen for his approach; I only explained the details.

Wintermute
  • 394
  • 4
  • 19
3

Here is a solution which creates a "range" table in a sub-query and then uses this to partition the data from the main table:

SELECT DISTINCT descr
  , COUNT(*) OVER (PARTITION BY descr) n
FROM age_table INNER JOIN (
  select '1-10' descr, 1 rng_start, 10 rng_stop from dual
  union (
  select '11-20', 11, 20 from dual
  ) union (
  select '20+', 21, null from dual
)) ON age BETWEEN nvl(rng_start, age) AND nvl(rng_stop, age)
ORDER BY descr;
Dan
  • 548
  • 2
  • 7
2

I had to group data by how many transactions appeared in an hour. I did this by extracting the hour from the timestamp:

select extract(hour from transaction_time) as hour
      ,count(*)
from   table
where  transaction_date='01-jan-2000'
group by
       extract(hour from transaction_time)
order by
       extract(hour from transaction_time) asc
;

Giving output:

HOUR COUNT(*)
---- --------
   1     9199 
   2     9167 
   3     9997 
   4     7218

As you can see this gives a nice easy way of grouping the number of records per hour.

Clarkey
  • 1,553
  • 5
  • 22
  • 34
1

I had to get a count of samples by day. Inspired by @Clarkey I used TO_CHAR to extract the date of sample from the timestamp to an ISO-8601 date format and used that in the GROUP BY and ORDER BY clauses. (Further inspired, I also post it here in case it is useful to others.)

SELECT 
  TO_CHAR(X.TS_TIMESTAMP, 'YYYY-MM-DD') AS TS_DAY, 
  COUNT(*) 
FROM   
  TABLE X
GROUP BY
  TO_CHAR(X.TS_TIMESTAMP, 'YYYY-MM-DD')
ORDER BY
  TO_CHAR(X.TS_TIMESTAMP, 'YYYY-MM-DD') ASC
/
Kieron Hardy
  • 731
  • 7
  • 8
1

add an age_range table and an age_range_id field to your table and group by that instead.

// excuse the DDL but you should get the idea

create table age_range(
age_range_id tinyint unsigned not null primary key,
name varchar(255) not null);

insert into age_range values 
(1, '18-24'),(2, '25-34'),(3, '35-44'),(4, '45-54'),(5, '55-64');

// again excuse the DML but you should get the idea

select
 count(*) as counter, p.age_range_id, ar.name
from
  person p
inner join age_range ar on p.age_range_id = ar.age_range_id
group by
  p.age_range_id, ar.name order by counter desc;

You can refine this idea if you like - add from_age to_age columns in the age_range table etc - but i'll leave that to you.

hope this helps :)

Jon Black
  • 16,223
  • 5
  • 43
  • 42
  • Judging by the other responses performance and flexibility arent important criteria. The explain plans for all the dynamic queries listed would be horrendous and you'd have to amend code if your age ranges changed. Each to their own i guess :P – Jon Black Mar 21 '10 at 00:25
  • 1 full scan will allways be faster than 2 full scans. Also, people who ask for age range statistics probably have had the same ranges for last 20+ years and have no intentions to change that. – jva Mar 21 '10 at 21:54
  • 1
    I'm pretty sure the physical column will out perform a derived/calculated one. Infact it's probably an ideal candidate for a bitmap index. I'd still prefer to use a lookup table than to hardcode values into my applications. Adding a new age range say 14-16 yrs and i'm inserting a new row vs. raising a change request, spending time coding and testing the changes and releasing into prod. – Jon Black Mar 21 '10 at 23:08
1

If using Oracle 9i+, you might be able to use the NTILE analytic function:

WITH tiles AS (
  SELECT t.age,
         NTILE(3) OVER (ORDER BY t.age) AS tile
    FROM TABLE t)
  SELECT MIN(t.age) AS min_age,
         MAX(t.age) AS max_age,
         COUNT(t.tile) As n
    FROM tiles t
GROUP BY t.tile

The caveat to NTILE is that you can only specify the number of partitions, not the break points themselves. So you need to specify a number that is appropriate. IE: With 100 rows, NTILE(4) will allot 25 rows to each of the four buckets/partitions. You can not nest analytic functions, so you'd have to layer them using subqueries/subquery factoring to get desired granularity. Otherwise, use:

  SELECT CASE t.age
           WHEN BETWEEN 1 AND 10 THEN '1-10' 
           WHEN BETWEEN 11 AND 20 THEN '11-20' 
           ELSE '21+' 
         END AS age, 
         COUNT(*) AS n
    FROM TABLE t
GROUP BY CASE t.age
           WHEN BETWEEN 1 AND 10 THEN '1-10' 
           WHEN BETWEEN 11 AND 20 THEN '11-20' 
           ELSE '21+' 
         END
OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
1

Can you try the below solution:

SELECT count (1), '1-10'  where age between 1 and 10
union all 
SELECT count (1), '11-20'  where age between 11 and 20
union all
select count (1), '21+' where age >20
from age 
0

My approach:

select range, count(1) from (
select case 
  when age < 5 then '0-4' 
  when age < 10 then '5-9' 
  when age < 15 then '10-14' 
  when age < 20 then '15-20' 
  when age < 30 then '21-30' 
  when age < 40 then '31-40' 
  when age < 50 then '41-50' 
  else                '51+' 
end 
as range from
(select round(extract(day from feedback_update_time - feedback_time), 1) as age
from txn_history
) ) group by range  
  • I have flexibility in defining the ranges
  • I do not repeat the ranges in select and group clauses
  • but some one please tell me, how to order them by magnitude!
Ananth N
  • 61
  • 1
  • 2