1

I have a dataframe of several accounts that display different modes of animal categories. How can I identify the accounts that have more than 1 mode?

For example, note that account 3 only has one mode (i.e. "dog"), but accounts 1, 2 and 4 have multiple modes (i.e more than one mode).

test = pd.DataFrame({'account':[1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
             'category':['cat','dog','rabbit','cat','cat','dog','dog','dog','dog','dog','rabbit','rabbit','cat','cat','rabbit']})

The expected output I'm looking for would be something like this:

pd.DataFrame({'account':[1,2,4],'modes':[3,2,2]})

Secondary to this, I am then trying to take any random highest mode for all accounts having multiple modes. I have come up with the following code, however, this only returns the first (alphabetical) mode for each account. My intuition tells me something could be written within the iloc brackets below, perhaps a a random array between 0 and the total number of modes, but I'm unable to fully get there.

test.groupby('account')['category'].agg(lambda x: x.mode(dropna=False).iloc[0])

Any suggestions? Thanks much.

Gabe Verzino
  • 346
  • 1
  • 10
  • 1
    what is your expected output from the given dataframe? – Anurag Dabas Jul 26 '21 at 17:16
  • Updated in question, thanks. – Gabe Verzino Jul 26 '21 at 17:30
  • Is selecting a random mode or really the secondary problem? – ALollz Jul 26 '21 at 17:59
  • You deleted your answer! I thought it was a good one. Ultimately, I'll need a random mode because I'm dealing with ICD-10 codes for which multiple exist per visit, often of the same type (thus generating modes). Though its not very common (12% frequency), I'd like to select any random mode, and not the first alphabetical mode, so that my classification model isn't biasing towards the beginning of the alphabet :p That's the very long version of my question. Thanks for your help! – Gabe Verzino Jul 26 '21 at 18:05
  • 1
    @GabeVerzino I undeleted, but had deleted because from the wording it seemed like you wanted to focus more on determining which ones were duplicated, which my answer doesn't really address, but it does succeed in returning a random mode, so I'll add it back since it's useful – ALollz Jul 26 '21 at 18:12

3 Answers3

1

You can use numpy.random.choice for that

test.groupby('account')['category'].agg(
    lambda x: np.random.choice(x.mode(dropna=False)))
Chris Wesseling
  • 6,226
  • 2
  • 36
  • 72
1

Not sure that what you want but you can trY:

out=test.groupby('account')['category'].apply(lambda x: x.mode(dropna=False).values)

output of out:

account
1    [cat, dog, rabbit]
2            [cat, dog]
3                 [dog]
4         [cat, rabbit]
Name: category, dtype: object

For random mode value:

from random import choice

out=test.groupby('account')['category'].agg(
    lambda x: choice(x.mode(dropna=False)))

output of out(every time you run the code you got different output):

account
1    rabbit
2       dog
3       dog
4    rabbit
Name: category, dtype: object

For your expected output use:

out=test.groupby('account')['category'].apply(lambda x: x.mode(dropna=False).count()).reset_index()
out=out[out['category'].ne(1)]

output of out:

account     category
0   1       3
1   2       2
3   4       2
Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41
1

Because you just want any random mode in the case of ties, you can use groupby + size. (Essentially a very similar wrapper to @abw333's solution). This is nice because it avoids any groupby.apply in favor of the built-in groupby.size, which is fast.

We use sort=False in the groupby, so the resulting Series is ordered by the order in which the groups appear in the original DataFrame. Then because the sorting algorithm, 'mergesort', is stable in the case of ties this will deterministically return the mode that appears first (earlier row) in the DataFrame in the case of ties. So if you want to get a random mode you can .sample(frac=1) before applying this so that it shuffles all the rows and then returns the mode.

def fast_mode(df, gp_cols, value_col):
    """ 
    Calculate the mode of a column, ignoring null values recognized by pandas. 

    If there is a tie for the mode, the modal value is the modal value that appears **first** in the
    DataFrame. 

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame over which to calcualate the mode. 
    gp_cols : list of str
        Columns to groupby for calculation of mode.
    value_col : str
        Column for which to calculate the mode. 

    Return
    ------ 
    pandas.DataFrame
        One row for the modal value per key_cols 
    """

    return ((df.groupby(gp_cols + [value_col], observed=True, sort=False).size() 
               .to_frame('mode_counts').reset_index() 
               .sort_values('mode_counts', ascending=False, kind='mergesort') 
               .drop_duplicates(subset=gp_cols))
             .reset_index(drop=True))

# Will always return the same one that occurs first in DataFrame
fast_mode(df, gp_cols=['account'], value_col='category')
#   account category  mode_counts
#0        3      dog            3
#1        2      cat            2
#2        4   rabbit            2
#3        1      cat            1

# Sampling allows you to select a "random one" in case of ties
fast_mode(df.sample(frac=1, random_state=12), gp_cols=['account'], value_col='category')
#   account category  mode_counts
#0        3      dog            3
#1        4      cat            2
#2        2      dog            2
#3        1      cat            1
ALollz
  • 57,915
  • 7
  • 66
  • 89