12

EDIT: I'm using MySQL, I found another post with the same question, but it's in Postgres; I require MySQL.

Get most common value for each value of another column in SQL

I ask this question after extensive searching of this site and others but have not found a result that works as I intend it to.

I have a table of people (recordid, personid, transactionid) and a transaction table (transactionid, rating). I require a single SQL statement that can return the most common rating each person has.

I currently have this SQL statement that returns the most common rating for a specified person id. It works and perhaps it may help others.

SELECT transactionTable.rating as MostCommonRating 
FROM personTable, transactionTable 
WHERE personTable.transactionid = transactionTable.transactionid 
AND personTable.personid = 1
GROUP BY transactionTable.rating 
ORDER BY COUNT(transactionTable.rating) desc 
LIMIT 1

However I require a statement that does what the above statement does for each personid in personTable.

My attempt is below; however, it times out my MySQL server.

SELECT personid AS pid, 
(SELECT transactionTable.rating as MostCommonRating 
FROM personTable, transactionTable 
WHERE personTable.transactionid = transactionTable.transactionid 
AND personTable.personid = pid
GROUP BY transactionTable.rating 
ORDER BY COUNT(transactionTable.rating) desc 
LIMIT 1)
FROM persontable
GROUP BY personid

Any help you can give me would be much obliged. Thanks.

PERSONTABLE:

RecordID,   PersonID,   TransactionID
1,      Adam,       1
2,      Adam,       2
3,      Adam,       3
4,      Ben,        1
5,      Ben,        3
6,      Ben,        4
7,      Caitlin,    4
8,      Caitlin,    5
9,      Caitlin,    1

TRANSACTIONTABLE:

TransactionID,  Rating
1       Good
2       Bad
3       Good
4       Average
5       Average

The output of the SQL statement I am searching for would be:

OUTPUT:

PersonID,   MostCommonRating
Adam        Good
Ben         Good
Caitlin     Average
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Rusty Robot
  • 1,725
  • 2
  • 13
  • 29
  • Please tag with the database engine you're using. – Mat Sep 16 '12 at 11:15
  • I'm about to head out for a few hours so can only check this again in like 5 hours,but, a few things pop to mind. Have you tried putting an index on any of these columns? That is likely the reason your SQL server is timing out. http://msdn.microsoft.com/en-us/library/ms190806.aspx – user1590499 Sep 16 '12 at 11:15
  • I think you have a typo in the first row in the `TRANSACTIONTABLE` `1 GOOD` shouldn't be `1 Very bad` because you have `3 GOOD`?? – Mahmoud Gamal Sep 16 '12 at 11:15
  • The typo is not the cause, these examples are simply to simulate the problem I am having, they are not the actual data I am working with. – Rusty Robot Sep 16 '12 at 13:11
  • This post is the same as mine but the answer is in POSTGRES I require MYSQL. http://stackoverflow.com/questions/344665/get-most-common-value-in-sql – Rusty Robot Sep 16 '12 at 13:11
  • Possible duplicate of [MySQL SELECT most frequent by group](http://stackoverflow.com/questions/1407723/mysql-select-most-frequent-by-group) – Ciro Santilli OurBigBook.com Nov 21 '15 at 10:38

3 Answers3

29

Preliminary comment

Please learn to use the explicit JOIN notation, not the old (pre-1992) implicit join notation.

Old style:

SELECT transactionTable.rating as MostCommonRating 
FROM personTable, transactionTable 
WHERE personTable.transactionid = transactionTable.transactionid 
AND personTable.personid = 1
GROUP BY transactionTable.rating 
ORDER BY COUNT(transactionTable.rating) desc 
LIMIT 1

Preferred style:

SELECT transactionTable.rating AS MostCommonRating 
  FROM personTable
  JOIN transactionTable 
    ON personTable.transactionid = transactionTable.transactionid 
 WHERE personTable.personid = 1
 GROUP BY transactionTable.rating 
 ORDER BY COUNT(transactionTable.rating) desc 
 LIMIT 1

You need an ON condition for each JOIN.

Also, the personID values in the data are strings, not numbers, so you'd need to write

 WHERE personTable.personid = "Ben"

for example, to get the query to work on the tables shown.


Main answer

You're seeking to find an aggregate of an aggregate: in this case, the maximum of a count. So, any general solution is going to involve both MAX and COUNT. You can't apply MAX directly to COUNT, but you can apply MAX to a column from a sub-query where the column happens to be a COUNT.

Build the query up using Test-Driven Query Design — TDQD.

Select person and transaction rating

SELECT p.PersonID, t.Rating, t.TransactionID
  FROM PersonTable AS p
  JOIN TransactionTable AS t
    ON p.TransactionID = t.TransactionID

Select person, rating, and number of occurrences of rating

SELECT p.PersonID, t.Rating, COUNT(*) AS RatingCount
  FROM PersonTable AS p
  JOIN TransactionTable AS t
    ON p.TransactionID = t.TransactionID
 GROUP BY p.PersonID, t.Rating

This result will become a sub-query.

Find the maximum number of times the person gets any rating

SELECT s.PersonID, MAX(s.RatingCount)
  FROM (SELECT p.PersonID, t.Rating, COUNT(*) AS RatingCount
          FROM PersonTable AS p
          JOIN TransactionTable AS t
            ON p.TransactionID = t.TransactionID
         GROUP BY p.PersonID, t.Rating
       ) AS s
 GROUP BY s.PersonID

Now we know which is the maximum count for each person.

Required result

To get the result, we need to select the rows from the sub-query which have the maximum count. Note that if someone has 2 Good and 2 Bad ratings (and 2 is the maximum number of ratings of the same type for that person), then two records will be shown for that person.

SELECT s.PersonID, s.Rating
  FROM (SELECT p.PersonID, t.Rating, COUNT(*) AS RatingCount
          FROM PersonTable AS p
          JOIN TransactionTable AS t
            ON p.TransactionID = t.TransactionID
         GROUP BY p.PersonID, t.Rating
       ) AS s
  JOIN (SELECT s.PersonID, MAX(s.RatingCount) AS MaxRatingCount
          FROM (SELECT p.PersonID, t.Rating, COUNT(*) AS RatingCount
                  FROM PersonTable AS p
                  JOIN TransactionTable AS t
                    ON p.TransactionID = t.TransactionID
                 GROUP BY p.PersonID, t.Rating
               ) AS s
         GROUP BY s.PersonID
       ) AS m
    ON s.PersonID = m.PersonID AND s.RatingCount = m.MaxRatingCount

If you want the actual rating count too, that's easily selected.

That's a fairly complex piece of SQL. I would hate to try writing that from scratch. Indeed, I probably wouldn't bother; I'd develop it step-by-step, more or less as shown. But because we've debugged the sub-queries before we use them in bigger expressions, we can be confident of the answer.

WITH clause

Note that Standard SQL provides a WITH clause that prefixes a SELECT statement, naming a sub-query. (It can also be used for recursive queries, but we aren't needing that here.)

WITH RatingList AS
     (SELECT p.PersonID, t.Rating, COUNT(*) AS RatingCount
        FROM PersonTable AS p
        JOIN TransactionTable AS t
          ON p.TransactionID = t.TransactionID
       GROUP BY p.PersonID, t.Rating
     )
SELECT s.PersonID, s.Rating
  FROM RatingList AS s
  JOIN (SELECT s.PersonID, MAX(s.RatingCount) AS MaxRatingCount
          FROM RatingList AS s
         GROUP BY s.PersonID
       ) AS m
    ON s.PersonID = m.PersonID AND s.RatingCount = m.MaxRatingCount

This is simpler to write. Unfortunately, MySQL does not yet support the WITH clause.


The SQL above has now been tested against IBM Informix Dynamic Server 11.70.FC2 running on Mac OS X 10.7.4. That test exposed the problem diagnosed in the preliminary comment. The SQL for the main answer worked correctly without needing to be changed.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 1
    +1, Great Answer, I have been trying to do it for more than have an hour. Here is a [Live demo](http://data.stackexchange.com/stackoverflow/query/80151/returning-the-aggregate-of-an-aggregate-the-maximum-of-a-count). But is there any common name for such a problem? Is it the aggregate of an aggregate? or what. Thanks – Mahmoud Gamal Sep 16 '12 at 15:35
  • Perfect! Thanks so much for this it help heaps. It was easy to understand because of the way you set it out and developed the query over time. A+ – Rusty Robot Sep 17 '12 at 10:54
  • As a substitute for the WITH clause, you can use CREATE TEMPORARY TABLE RatingList AS _select-statement_. – Barmar Sep 27 '12 at 20:26
  • Sigh, that doesn't work - MySQL doesn't let you refer to a temporary table multiple times in a query. You need to create a real table. – Barmar Sep 27 '12 at 20:33
  • 1
    @Barmar: Bad luck. Those sorts of restrictions — non-obvious and not obviously necessary — are irksome in a DBMS. But not being able to use a temp table name multiple times in a single query does seem like a peculiar restriction. – Jonathan Leffler Sep 27 '12 at 21:17
3

Here's a somewhat hacky abuse of the fact that the max aggregate function in MySQL does lexical sorting on varchars (as well as the expected numerical sorting on integers/floats):

SELECT 
  PersonID, 
  substring(max(concat(lpad(c, 20, '0'), Rating)), 21) AS MostFrequentRating 
FROM (
    SELECT PersonID, Rating, count(*) c 
    FROM PERSONTABLE INNER JOIN TRANSACTIONTABLE USING(TransactionID) 
    GROUP BY PersonID, Rating
) AS grouped_ratings 
GROUP BY PersonID;

Which gives the desired:

+----------+--------------------+
| PersonID | MostFrequentRating |
+----------+--------------------+
| Adam     | Good               |
| Ben      | Good               |
| Caitlin  | Average            |
+----------+--------------------+

(note, if there are multiple modes per person, it will pick the one with the highest alphabetic entry, so — pretty much randomly — Good over Bad and Bad over Average)

You should be able to see what the max is operating over by examining the following:

SELECT PersonID, Rating, count(*) c, concat(lpad(count(*), 20, '0'), Rating) as LexicalMaxMe 
FROM PERSONTABLE INNER JOIN TRANSACTIONTABLE USING(TransactionID) 
GROUP BY PersonID, Rating
ORDER BY PersonID, c DESC;

Which outputs:

+----------+---------+---+-----------------------------+
| PersonID | Rating  | c | LexicalMaxMe                |
+----------+---------+---+-----------------------------+
| Adam     | Good    | 2 | 00000000000000000002Good    |
| Adam     | Bad     | 1 | 00000000000000000001Bad     |
| Ben      | Good    | 2 | 00000000000000000002Good    |
| Ben      | Average | 1 | 00000000000000000001Average |
| Caitlin  | Average | 2 | 00000000000000000002Average |
| Caitlin  | Good    | 1 | 00000000000000000001Good    |    
+----------+---------+---+-----------------------------+
EoghanM
  • 25,161
  • 23
  • 90
  • 123
0

For anyone using Microsoft SQL Server: You have the possibility to create a custom aggregate function to get the most common value. Example 2 of this blog post by Ahmed Tarek Hasan describes how to do it:

http://developmentsimplyput.blogspot.nl/2013/03/creating-sql-custom-user-defined.html

Jorr.it
  • 1,222
  • 1
  • 14
  • 25