9

How do I get the most frequently occurring category for each tag in MySQL? Ideally, I would want to simulate an aggregate function that would calculate the mode of a column.

SELECT 
  t.tag 
  , s.category 
FROM tags t 
LEFT JOIN stuff s 
USING (id) 
ORDER BY tag;

+------------------+----------+
| tag              | category |
+------------------+----------+
| automotive       |        8 |
| ba               |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |       10 |
| bamboo           |        8 |
| bamboo           |        9 |
| bamboo           |        8 |
| bamboo           |       10 |
| bamboo           |        8 |
| bamboo           |        9 |
| bamboo           |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| bath             |        9 |
+-----------------------------+
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
Stephen Fuhry
  • 12,624
  • 6
  • 56
  • 55
  • Just thought I'd mention a couple years later and smarter - don't organize tags like this, it's an antipattern. Use a many2many table to define the relationship between tags and items. That said, I still wish there were a MODE aggregate function in MySQL. – Stephen Fuhry Feb 12 '12 at 17:31
  • There's no need to abbreviate `stuff` to `s` in the question. In this case it makes it harder to spot that `category` belongs to the `s` rather than `t` table as they are both single characters. – EoghanM Jun 18 '19 at 13:29

5 Answers5

4
SELECT t1.*
FROM (SELECT tag, category, COUNT(*) AS count
      FROM tags INNER JOIN stuff USING (id)
      GROUP BY tag, category) t1
LEFT OUTER JOIN 
     (SELECT tag, category, COUNT(*) AS count
      FROM tags INNER JOIN stuff USING (id)
      GROUP BY tag, category) t2
  ON (t1.tag = t2.tag AND (t1.count < t2.count 
      OR t1.count = t2.count AND t1.category < t2.category))
WHERE t2.tag IS NULL
ORDER BY t1.count DESC;

I agree this is kind of too much for a single SQL query. Any use of GROUP BY inside a subquery makes me wince. You can make it look simpler by using views:

CREATE VIEW count_per_category AS
    SELECT tag, category, COUNT(*) AS count
    FROM tags INNER JOIN stuff USING (id)
    GROUP BY tag, category;

SELECT t1.*
FROM count_per_category t1
LEFT OUTER JOIN count_per_category t2
  ON (t1.tag = t2.tag AND (t1.count < t2.count 
      OR t1.count = t2.count AND t1.category < t2.category))
WHERE t2.tag IS NULL
ORDER BY t1.count DESC;

But it's basically doing the same work behind the scenes.

You comment that you could do a similar operation easily in application code. So why don't you do that? Do the simpler query to get the counts per category:

SELECT tag, category, COUNT(*) AS count
FROM tags INNER JOIN stuff USING (id)
GROUP BY tag, category;

And sort through the result in application code.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • I've been having difficulty getting that to work.. It would seem to be better to make an aggregate function MOST_FREQUENT().. I'm going to see if that is within my skill level here... – Stephen Fuhry Sep 11 '09 at 15:09
  • Sorry, I misunderstood your schema. I've taken a closer look and mocked up a test database so I could be sure the query works. Try the edited version above. – Bill Karwin Sep 11 '09 at 15:48
  • That seems to work. It's a little tough to swallow though.. and there are two sub-selects instead of just one. I wish there were just a built in aggregate function MEAN() or something :-P. I could probably write that using C in like 5min. – Stephen Fuhry Sep 11 '09 at 16:03
  • See my additional content after the separator line. – Bill Karwin Sep 11 '09 at 16:46
4
SELECT  tag, category
FROM    (
        SELECT  @tag <> tag AS _new,
                @tag := tag AS tag,
                category, COUNT(*) AS cnt
        FROM    (
                SELECT  @tag := ''
                ) vars,
                stuff
        GROUP BY
                tag, category
        ORDER BY
                tag, cnt DESC
        ) q
WHERE   _new

On your data, this returns the following:

'automotive',  8
'ba',          8
'bamboo',      8
'bananatree',  8
'bath',        9

Here's the test script:

CREATE TABLE stuff (tag VARCHAR(20) NOT NULL, category INT NOT NULL);

INSERT
INTO    stuff
VALUES
('automotive',8),
('ba',8),
('bamboo',8),
('bamboo',8),
('bamboo',8),
('bamboo',8),
('bamboo',8),
('bamboo',10),
('bamboo',8),
('bamboo',9),
('bamboo',8),
('bamboo',10),
('bamboo',8),
('bamboo',9),
('bamboo',8),
('bananatree',8),
('bananatree',8),
('bananatree',8),
('bananatree',8),
('bath',9);
Quassnoi
  • 413,100
  • 91
  • 616
  • 614
  • This should be the correct answer. It avoids self joining a group by query, and should be a lot more efficient. it does a single group by and iterates it, marking the rows we need, and then filters them out. Smart! – Kjell Rilbe Dec 10 '18 at 10:03
3

(Edit: forgot DESC in ORDER BYs)

Easy to do with a LIMIT in the subquery. Does MySQL still have the no-LIMIT-in-subqueries restriction? Below example is using PostgreSQL.

=> select tag, (select category from stuff z where z.tag = s.tag group by tag, category order by count(*) DESC limit 1) AS category, (select count(*) from stuff z where z.tag = s.tag group by tag, category order by count(*) DESC limit 1) AS num_items from stuff s group by tag;
    tag     | category | num_items 
------------+----------+-----------
 ba         |        8 |         1
 automotive |        8 |         1
 bananatree |        8 |         4
 bath       |        9 |         1
 bamboo     |        8 |         9
(5 rows)

Third column is only necessary if you need the count.

1

This is for simpler situations:

SELECT action, COUNT(action) AS ActionCount FROM log GROUP BY action ORDER BY ActionCount DESC;

abhisek
  • 924
  • 1
  • 13
  • 27
0

Here's a hacky approach to this which utilizes the max aggregate function seeing as there is no mode aggregate function in MySQL (or windowing functions etc.) that would allow this:

SELECT  
  tag, 
  convert(substring(max(concat(lpad(c, 20, '0'), category)), 21), int) 
        AS most_frequent_category 
FROM (
    SELECT tag, category, count(*) AS c
    FROM tags INNER JOIN stuff using (id) 
    GROUP BY tag, category
) as grouped_cats 
GROUP BY tag;

Basically it utilizes the fact that we can find the lexical max of the counts of each individual category.

This is easier to see with named categories:

create temporary table tags (id int auto_increment primary key, tag character varying(20));
create temporary table stuff (id int, category character varying(20));
insert into tags (tag) values ('automotive'), ('ba'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('banana tree'), ('banana tree'), ('banana tree'), ('banana tree'), ('bath');
insert into stuff (id, category) values (1, 'cat-8'), (2, 'cat-8'), (3, 'cat-8'), (4, 'cat-8'), (5, 'cat-8'), (6, 'cat-8'), (7, 'cat-8'), (8, 'cat-10'), (9, 'cat-8'), (10, 'cat-9'), (11, 'cat-8'), (12, 'cat-10'), (13, 'cat-8'), (14, 'cat-9'), (15, 'cat-8'), (16, 'cat-8'), (17, 'cat-8'), (18, 'cat-8'), (19, 'cat-8'), (20, 'cat-9');

In which case we shouldn't be doing integer conversion on the most_frequent_category column:

SELECT 
  tag, 
  substring(max(concat(lpad(c, 20, '0'), category)), 21) AS most_frequent_category 
FROM (
    SELECT tag, category, count(*) AS c
    FROM tags INNER JOIN stuff using (id) 
    GROUP BY tag, category
) as grouped_cats 
GROUP BY tag;

+-------------+------------------------+
| tag         | most_frequent_category |
+-------------+------------------------+
| automotive  | cat-8                  |
| ba          | cat-8                  |
| bamboo      | cat-8                  |
| banana tree | cat-8                  |
| bath        | cat-9                  |
+-------------+------------------------+

And to delve a little bit more into what is going on, here's what the grouped_cats inner select looks like (I've added order by tag, c desc):

+-------------+----------+---+
| tag         | category | c |
+-------------+----------+---+
| automotive  | cat-8    | 1 |
| ba          | cat-8    | 1 |
| bamboo      | cat-8    | 9 |
| bamboo      | cat-10   | 2 |
| bamboo      | cat-9    | 2 |
| banana tree | cat-8    | 4 |
| bath        | cat-9    | 1 |
+-------------+----------+---+

And we can see how the max of the count(*) column drags along it's associated category if we omit the substring bit:

SELECT 
  tag, 
  max(concat(lpad(c, 20, '0'), category)) AS xmost_frequent_category
FROM (
    SELECT tag, category, count(*) AS c
    FROM tags INNER JOIN stuff using (id) 
    GROUP BY tag, category
) as grouped_cats 
GROUP BY tag;

+-------------+---------------------------+
| tag         | xmost_frequent_category   |
+-------------+---------------------------+
| automotive  | 00000000000000000001cat-8 |
| ba          | 00000000000000000001cat-8 |
| bamboo      | 00000000000000000009cat-8 |
| banana tree | 00000000000000000004cat-8 |
| bath        | 00000000000000000001cat-9 |
+-------------+---------------------------+
EoghanM
  • 25,161
  • 23
  • 90
  • 123