2

I need to find the mode/most common element of a pandas groupby object or an individual series, and for that I have the following function:

def get_most_common(srs):
    from collections import Counter
    import numpy as np

    x = list(srs)
    my_counter = Counter(x)
    if np.nan not in my_counter.keys():
        most_common_value = my_counter.most_common(1)[0][0]
    else:
        most_common_value = srs.mode(dropna=False).iloc[0]

    return most_common_value

In case of ties, I don't care which one is selected -- random is okay.

Counter is faster when there are no NaNs but gives wrong results with NaNs. pd.Series.mode is always correct, but it's slower than Counter when there is no NaN. It's a gamble (faster when there is no NaN but slower when there is a NaN due to the extra check np.nan not in my_counter.keys()). So far, I get satisfactory performance with my large dataset, probably because there are many cases where there are no NaNs. But is there a way to make this any faster?

irene
  • 2,085
  • 1
  • 22
  • 36
  • You really don't need the if/else and `Counter` in this case. Regardless if `np.nan` is in your `Series`, you are always asking for the most common value, which can be `np.nan`. Instead of defining `get_most_common`, just use `srs.mode(dropna=False)` and you'll get the same result. – r.ook Apr 08 '20 at 16:59
  • can you fillna before doing the groupby with a value not in your original data (like the max + 1 or something like this) and then use only `Counter`? if this value is return you know that it was nan? – Ben.T Apr 08 '20 at 17:03
  • @r.ook I tried using `srs.mode(dropna=False)` exclusively. It's much slower than what I did above. – irene Apr 08 '20 at 17:04
  • @Ben.T yes I tried doing something like a `srs.fillna('NAN')`, used `Counter`, and then if I got `'NAN'` as the mode, I would revert it to `np.nan`. I tried this on a sample Series with timeit (not on the large dataset) and it seems slower than my function above, even without a NaN in the Series. – irene Apr 08 '20 at 17:06
  • @irene if I understand correctly, `get_most_common` is the function you applied per group, so it would be faster if you do this **before** the groupby, you do it once on the whole column, not each time in your function? – Ben.T Apr 08 '20 at 17:08
  • @Ben.T Hmm could be...if I apply this before the groupby it will be a one-time thing, would that be the case? – irene Apr 08 '20 at 17:10
  • @irene Indeed, also if your column is `float` or `int`, I would rather fill the missing value with the same type, I never tried timing it, but I assume if you fill with a string, then your column become object type and the speed of operation on it may be affected – Ben.T Apr 08 '20 at 17:12
  • 1
    @Ben.T, the series elements are strings. I found filling in with 0 is better than filling in with a string, but from timeit, performance seems to be equally slow for series with NaNs and series without NaNs. My function above works much better without NaNs (almost as fast as `Counter` alone), and it seems that's why it works okay with the large dataset. – irene Apr 08 '20 at 17:16

1 Answers1

2

I find it odd that you're getting better performance using Counter. Here's my test result (n=10000):

Using Series.mode on Series with nan: 52.41649858
Using Series.mode on Series without nan: 17.186453438
Using Counter on Series with nan: 269.33117825500005
Using Counter on Series without nan: 134.207576572

#-----------------------------------------------------#

             Series.mode  Counter
             -----------  -------------
With nan     52.42s       269.33s
Without nan  17.19s       134.21s

Test code:

import timeit

setup = '''
import pandas as pd
from collections import Counter

def get_most_common(srs):
    return srs.mode(dropna=False)[0]

def get_most_common_counter(srs):
    x = list(srs)
    my_counter = Counter(x)
    return my_counter.most_common(1)[0][0]

df = pd.read_csv(r'large.data')
'''

print(f"""Using Series.mode on Series with nan: {timeit.timeit('get_most_common(df["has_nan"])', setup=setup, number=10000)}""")
print(f"""Using Series.mode on Series without nan: {timeit.timeit('get_most_common(df["no_nan"])', setup=setup, number=10000)}""")
print(f"""Using Counter on Series with nan: {timeit.timeit('get_most_common_counter(df["has_nan"])', setup=setup, number=10000)}""")
print(f"""Using Counter on Series without nan: {timeit.timeit('get_most_common_counter(df["no_nan"])', setup=setup, number=10000)}""")

large.data is a 2 x 50000 rows DataFrame of random 2-digit string from 0 to 99, where has_nan has a mode of nan=551.


If anything, your if np.nan not in my_counter.keys() condition will always be triggered, because np.nan is not in my_counter.keys(). So in actuality you never used pd.Series.mode, it was always using Counter. As mentioned in the other question, because your pandas object already created copies of np.nan within the Series/DataFrame, the in condition will never be fulfilled. Give it a try:

np.nan in pd.Series([np.nan, 1, 2]).to_list()
# False

Remove the entire complexity of the if/else and stick with one method. And then compare the performance. As mentioned in your other question, a pandas method would almost always be the better approach over any external modules/methods. If you are still observing otherwise, please update your question.

r.ook
  • 13,466
  • 2
  • 22
  • 39
  • some confusing behavior. Please check this snippet: `get_most_common(pd.Series([np.nan, np.nan, np.nan,"hello","hello", "hi","fat"]))`. In my machine, this goes through `pd.Series.mode` (I added a `print` statement for each condition) and I can verify this gives me the correct result (nan). Please note that I'm working with series involving either strings or NaNs. Somehow the results are different when the series is composed of ints and NaNs. – irene Apr 09 '20 at 08:02
  • 1
    Seems like the reference for `nan` is determined by the `dtype`. For your given case, the `dtype` would be `'O'` mixed type so the reference seem to be kept. For `[np.nan, 1, 2]` it'll be `dtype='float64'` so the `nan` object is newly created. – r.ook Apr 09 '20 at 15:15