Python: select most frequent using group by

Question

How do I get the most frequently occurring category (mode) for each tag in Python?

+------------------+----------+
| tag              | category |
+------------------+----------+
| automotive       |        8 |
| ba               |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |       10 |
| bamboo           |        8 |
| bamboo           |        9 |
| bamboo           |        8 |
| bamboo           |       10 |
| bamboo           |        8 |
| bamboo           |        9 |
| bamboo           |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| bath             |        9 |
+-----------------------------+

expected output would be like

    tag     | category 
------------+-----------
 ba         |        8      
 automotive |        8      
 bananatree |        8        
 bath       |        9    
 bamboo     |        8

I have borrowed table from Stephen J. Fuhry and edited output from David Fuhry on MySQL SELECT most frequent by group due to confidentiality of my data set.

If you can push that processing to the SQL server, I'd do that rather than have it shove loads of data over to your program (regardless of what it is written in) that you need to sift through. Ergo, I'd suggest this is a MySQL question more than Python. — Nick T, May 16 '14 at 04:04
what kind of object are you starting with? If you're set on sql, sqlite3 is a native package that can handle database functions in a memory (using the ':memory:' table). — DrBailey, May 16 '14 at 04:15
@ Nick T I am doing data analysis on my data set using Pandas and sk learn. At the end of the day, I still have to have my machine loaded the entire data set in my memory, anyways. — bensw, May 16 '14 at 04:37
@ DrBailey I'm afraid that might not work out since I have CSV file to begin with and loaded in to a data frame by Pandas. — bensw, May 16 '14 at 04:41

Karl D. · Accepted Answer · 2014-05-16T13:43:32.647

In the comments you note you're using pandas. You can do something like the following:

>>> df

           tag  category
0    automotive         8
1            ba         8
2        bamboo         8
3        bamboo         8
4        bamboo         8
5        bamboo         8
6        bamboo         8
7        bamboo        10
8        bamboo         8
9        bamboo         9
10       bamboo         8
11       bamboo        10
12       bamboo         8
13       bamboo         9
14       bamboo         8
15  banana tree         8
16  banana tree         8
17  banana tree         8
18  banana tree         8
19         bath         9

Do a groupby on 'tag' for the 'category' column and then within each group use the mode method. However, we have to make it a conditional because pandas doesn't return a number for the mode if the number of observations is less than 3 (we can just return the group itself in the special cases of 1 or 2 observations in a group). We can use the aggregate/agg method with a lambda function to do this:

>>> mode = lambda x: x.mode() if len(x) > 2 else np.array(x)
>>> df.groupby('tag')['category'].agg(mode)

tag
automotive     8
ba             8
bamboo         8
banana tree    8
bath           9

Note, when the mode is multi-modal you will get a array (numpy). For example, suppose there were two entries for bath (all the other data is the same):

tag|category
bath|9
bath|10

In that case the output would be:

>>> mode = lambda x: x.mode() if len(x) > 2 else np.array(x)
>>> df.groupby('tag')['category'].agg(mode)

tag
automotive           8
ba                   8
bamboo               8
banana tree          8
bath           [9, 10]

You can also use the value_counts method instead of mode. Once again, do a groupby on 'tag' for the 'category' column and then within each group use the value_counts method. value_counts arranges in descending order so you want to grab the index of the first row:

>>> df.groupby('tag')['category'].agg(lambda x: x.value_counts().index[0])

tag
automotive     8
ba             8
bamboo         8
banana tree    8
bath           9

However, this won't return an array in multi-modal situations. It will just return the first mode.

Python: select most frequent using group by

1 Answers1