1

I have a dataframe that, as a result of a previous group by, contains 5 rows and two columns. column A is a unique name, and column B contains a list of unique numbers that correspond to different factors related to the unique name. How can I find the most common number (mode) for each row?

df = pd.DataFrame({"A": [Name1,Name2,...], "B": [[3, 5, 6, 6], [1, 1, 1, 4],...]})

I have tried:

df['C'] = df[['B']].mode(axis=1)

but this simply creates a copy of the lists from column B. Not really sure how to access each list in this case.

Result should be:

  A:        B:       C:
Name 1  [3,5,6,6]    6
Name 2  [1,1,1,4]    1

Any help would be great.

Jasper_97
  • 27
  • 4

2 Answers2

1

I would use Pandas' .apply() function here. It will execute a function on each element in a series. First, we define the function, I'm taking the mode from Find the most common element in a list

def mode(lst):
    return max(set(lst), key=lst.count)

Then, we apply this function to the B column to get C:

df['C'] = df['B'].apply(mode)

Our output is:

>>> df
       A             B  C
0  Name1  [3, 5, 6, 6]  6
1  Name2  [1, 1, 1, 4]  1
SNygard
  • 916
  • 1
  • 9
  • 21
  • If the lists were larger than my example, would you suggest anything to avoid apply, or speed up the process? Thank you also for your reply. – Jasper_97 Aug 18 '22 at 13:10
  • Thank you again for the suggestion. I just wanted to note that the answer provided by jros is significantly faster than the custom mode function. Timings on lists of length 4+ million were 5 seconds (jros method), and 1 minute 5 seconds (custom mode function). – Jasper_97 Aug 19 '22 at 14:43
  • Very interesting! Thanks for the update. I'll leave my answer since some logic is more easily adapted to the `.apply()` method. In general, it's better to use the well-established functions (like jros' method) if they're available. – SNygard Aug 19 '22 at 21:53
1

Here's a method using statistics module's mode function

from statistics import mode

Two options:

df["C"] = df["B"].apply(mode)
df.head()
#   A        B              C
# 0 Name1   [3, 5, 6, 6]    6
# 1 Name2   [1, 1, 1, 4]    1

Or

df["C"] = [mode(df["B"][i]) for i in range(len(df))]
df.head()
#   A        B              C
# 0 Name1   [3, 5, 6, 6]    6
# 1 Name2   [1, 1, 1, 4]    1
jros
  • 387
  • 2
  • 15
  • Thanks for the reply. Would you have any recommendations if the lists were a lot larger than my example? To cut back on time to iterate through each list. – Jasper_97 Aug 18 '22 at 13:26
  • @Jasper_97 how large? Have you checked out this page already for some tips? https://pandas.pydata.org/docs/user_guide/scale.html – jros Aug 18 '22 at 14:12
  • Some of the row lists contain hundreds of thousands of values. – Jasper_97 Aug 18 '22 at 14:25
  • 1
    I ran a test of mode() on a list with length 500,000 and it took 0.0254 seconds, so depending on how many rows in your dataset that will obviously grow – jros Aug 18 '22 at 14:34
  • Would your approach be the same for a list of say 5,000,000? Just thinking in terms of scalability of the apply method, as my list lengths could vary depending on my initial group by parameters etc. – Jasper_97 Aug 18 '22 at 14:42
  • 1
    Again, it depends on how many times you're finding the mode of such long lists. I tested a single list of length of 5million on my mac and it took 0.21 seconds. The most popular method I've used for large quantities of lists is converting to numpy arrays then doing similar operation – jros Aug 18 '22 at 14:52
  • Given the built in function returns only the mode. How would you suggest getting say: top 2 numbers from a list, or top 5, with the above methods? – Jasper_97 Aug 19 '22 at 14:56