1

I have a MySQL table where there are many rows for each person, and I want to write a query which aggregates rows with special constraint. (one per person)

For example, lets say the table is consist of following data.

name   date                    reason
---------------------------------------
John   2013-04-01 14:00:00     Vacation
John   2013-03-31 18:00:00     Sick
Ted    2012-05-06 20:00:00     Sick
Ted    2012-02-20 01:00:00     Vacation
John   2011-12-21 00:00:00     Sick
Bob    2011-04-02 20:00:00     Sick

I want to see the distribution of 'reason' column. If I just write a query like below

select reason, count(*) as count from table group by reason

then I will be able to see number of reasons for this table overall.

reason       count
------------------
Sick         4
Vacation     2

However, I am only interested in single reason from each person. The reason that should be counted should be from a row with latest date from the person's records. For example, John's latest reason would be Vacation while Ted's latest reason would be Sick. And Bob's latest reason (and the only reason) is Sick.

The expected result for that query should be like below. (Sum of count will be 3 because there are only 3 people)

reason      count
-----------------
Sick        2
Vacation    1

Is it possible to write a query such that single latest reason will be counted when I want to see distribution(count) of reasons?

Here are some facts about the table.

  • The table has tens of millions of rows
  • For most of times, each person has one reason.
  • Some people have multiple reasons, but 99.99% of people have fewer than 5 reasons.
  • There are about 30 different reasons while there are millions of distinct names.
  • The table is partitioned based on date range.
user482594
  • 16,878
  • 21
  • 72
  • 108
  • I see different SQL solutions proposed by people. Thank you for the responses. But I am not sure which one would be ideal to run since the table has millions of rows. Will using the 'explain' query be able to tell me the best query? Or does anyone know which one of answers is the best one by just looking at it? – user482594 Apr 03 '13 at 09:10
  • The query are all very similars and will give you the same amount of time. Try it, the explain plan will never tell you if it is fast enough! – Cyril Gandon Apr 03 '13 at 09:22

5 Answers5

1
SELECT T.REASON, COUNT(*) 
FROM
(
 SELECT PERSON, MAX(DATE) AS MAX_DATE
 FROM TABLE-NAME
 GROUP BY PERSON
) A, TABLE-NAME T
WHERE T.PERSON = A.PERSON AND T.DATE = A.MAX_DATE
GROUP BY T.REASON
Santhosh
  • 1,771
  • 1
  • 15
  • 25
  • It should be `A.MAX_DATE` in the `WHERE` clause. and its giving count as 2 for both vacation and sick – Ajo Koshy Apr 03 '13 at 08:26
  • Your query gives the value as `Sick 1 and Vacation 2`. The required is `Sick 2 Vacation 1` – Ajo Koshy Apr 03 '13 at 08:58
  • @AjoKoshy Done with where clause. I don't think u will get two for vactaion as one only John is having vacation, John 2013-04-01 14:00:00 Vacation – Santhosh Apr 03 '13 at 08:59
0

Try this

select reason, count(*) from 
(select reason from table where date in 
    (select max(date) from table group by name)) t 
group by reason
Mohan Raj B
  • 1,015
  • 7
  • 14
0

The solution you are looking for seems to be solved by this query :

select 
    reason, 
    count(*) 
from (select * from tablename group by name) abc 
group by 
    reason

It is quite fast and simple. You can view the SQL Fiddle

veljasije
  • 6,722
  • 12
  • 48
  • 79
Ajo Koshy
  • 1,205
  • 2
  • 21
  • 33
  • What is the default behavior when columns that are not part of `group by` clause have different values? For instance, you are using `group by` for name only while person has different dates and different reasons. Do you know if the behavior is consistent? – user482594 Apr 03 '13 at 09:17
  • @user482594 the default behavior here just helps you filter the number of repetitions.The `group by` in name helps you to get the latest reason per user. And then you can filter the results again as per requirement. The behavior is consistent in all ways – Ajo Koshy Apr 03 '13 at 09:21
0

In MySQL, it's not very efficient to do this kind of query since you don't have access to tools like partitionning query in SQL Server or Oracle.
You can still emulate it by doing a subquery and retrieve the rows based on the condition you need, here the maximum date :

SELECT t.reason, COUNT(1) 
FROM
(
     SELECT name, MAX(adate) AS maxDate
     FROM @aTable
     GROUP BY name
) maxDateRows
    INNER JOIN @aTable t ON maxDateRows.name = t.name
                        AND maxDateRows.maxDate = t.adate
GROUP BY t.reason

You can see a sample here.
Test this query on your samples, but I'm afraid that it will be slow as hell.

For your information, you can do the same thing in a more elegant and much much faster way in SQL Server :

SELECT reason, COUNT(1)
FROM
(
     SELECT name
          , reason
          , RANK() OVER(PARTITION BY name ORDER BY adate DESC) as Rank
     FROM @aTable
     ) AS rankTable
WHERE Rank = 1
GROUP BY reason

The sample is here

If you are really stuck to MySql, and the first query is too slow, then you can split the problem.

Do a first query creating a table:

CREATE TABLE maxDateRows AS
SELECT name, MAX(adate) AS maxDate
FROM @aTable
GROUP BY name

Then create index on both name and maxDate.
Finally, get the results :

SELECT t.reason, COUNT(1) 
FROM maxDateRows m
    INNER JOIN @aTable t ON m.name = t.name
                        AND m.maxDate = t.adate
GROUP BY t.reason
Community
  • 1
  • 1
Cyril Gandon
  • 16,830
  • 14
  • 78
  • 122
  • Yes.. I am stuck in MySQL for now. For your solution with `create table...` and `select`, can these 2 queries merged into one? I do not think I have seen that before.. so I am just asking to you about it. – user482594 Apr 03 '13 at 09:19
0

Apologies if this answer duplicates an existing. Maybe I'm suffering from some form aphasia but I cannot see it...

SELECT x.reason
     , COUNT(*) 
  FROM absentism x 
  JOIN 
     ( SELECT name,MAX(date) max_date FROM absentism GROUP BY name) y
    ON y.name = x.name 
   AND y.max_date = x.date 
 GROUP 
    BY reason;
Strawberry
  • 33,750
  • 13
  • 40
  • 57