How to replace values of a pandas column with the most frequent value

Question

I've already used different answers but not any of them solved my problem. I also looked at this answer. but it didn't work either. Here is my dataframe:

import numpy as np
import pandas as pd

np.random.seed(2)
col1 = np.random.choice([1,2,3], size=(50))
col2 = np.random.choice([1,2,3,4], size=(50))
col3 = np.random.choice(['a', 'b', 'c', 'd', 'e'], size=(50))
data = {'col1':col1, 'col2':col2, 'col3':col3}
df = pd.DataFrame(data)

I want to

1) perform a groupby on c1 and c2 columns and

2) create a new column that is the most frequent value on c3 column.

The final df should look like this:

    c1  c2  c3  c4
0   1   1   b   b
1   1   1   b   b
2   1   2   a   b
3   1   2   b   b
4   1   2   b   b
5   1   2   b   b
6   1   2   c   b
7   1   3   a   a
8   1   3   c   a
9   1   3   b   a
10  1   3   c   a
11  1   3   a   a
12  1   3   b   a
13  1   3   a   a
14  1   3   a   a
15  1   3   c   a
16  1   4   a   a
17  2   1   c   c
18  2   1   c   c
19  2   1   a   c
20  2   1   c   c
21  2   1   c   c
22  2   1   b   c
23  2   2   a   a
24  2   2   c   a
25  2   2   a   a
26  2   3   a   a
27  2   3   a   a
28  2   4   c   c
29  2   4   c   c
30  3   1   b   a
31  3   1   a   a
32  3   1   a   a
33  3   1   c   a
34  3   1   b   a
35  3   2   c   c
36  3   2   c   c
37  3   2   b   c
38  3   2   a   c
39  3   2   c   c
40  3   3   b   b
41  3   3   a   b
42  3   3   b   b
43  3   3   c   b
44  3   3   a   b
45  3   3   b   b
46  3   3   b   b
47  3   3   c   b
48  3   4   b   b
49  3   4   c   c

For example I used this code without any success:

df1 = df.groupby(['c1', 'c2'])['c3'].agg(lambda x:x.value_counts().index[0])

Try: `df['col4'] = df.groupby(['col1', 'col2'])['col3'].transform(pd.Series.mode)` — Erfan, Jul 09 '19 at 20:47
It gives error: ValueError: Wrong number of items passed 2, placement implies 5 — Muser, Jul 09 '19 at 20:53
Restart your kernel and try again, because this should work and also works for your sample data. — Erfan, Jul 09 '19 at 20:54
Did you exactly copy the code I provided? So usage of `.transform(pd.Series.mode)` with out the `()` behind Series.mode? Else I think you didnt describe your problem correctly. — Erfan, Jul 09 '19 at 20:57
@Erfan it doesn't work (try the 40 row data) if you have more than 1 modes, and you need to choose just 1 so as `transform` would work. — Quang Hoang, Jul 09 '19 at 21:05

Quang Hoang · Answer 1 · 2019-07-09T20:50:56.567

1

You want idxmax:

df['col4'] = df.groupby(['col1', 'col2']).col3.transform(lambda x: x.value_counts().idxmax())

Sample data:

np.random.seed(2)
col1 = np.random.choice([1,2,3], size=(10))
col2 = np.random.choice([1,2,3,4], size=(10))
col3 = np.random.choice(['a', 'b', 'c', 'd', 'e'], size=(10))
data = {'col1':col1, 'col2':col2, 'col3':col3}
df = pd.DataFrame(data)

gives:

   col1  col2 col3 col4
0     1     1    d    b
1     2     1    c    c
2     1     1    b    b
3     3     2    c    c
4     3     4    e    b
5     1     4    d    d
6     3     3    a    a
7     2     1    e    c
8     2     3    d    d
9     3     4    b    b

edited Jul 09 '19 at 20:50

answered Jul 09 '19 at 20:46

Quang Hoang

146,074
10
56
74

I think this doesn't work. – Muser Jul 09 '19 at 20:48
Did you try it? – Quang Hoang Jul 09 '19 at 20:52
Yes. But It doesn't give me what I wanted. – Muser Jul 09 '19 at 20:54
If you increase the sample size to 30, you can clearly see that this is not the case. – Muser Jul 09 '19 at 20:57
I'm not sure I follow. This should work with one caveat that it always select the lesser value, i.e, if `a` and `b` have same count, it chooses `a`, while your sample data gives `b`. But that's one thing you should mention if it is really important. – Quang Hoang Jul 09 '19 at 21:01
I could't produce the same result that I wanted with this code. – Muser Jul 09 '19 at 21:09
Can you elaborate? What exactly the difference? – Quang Hoang Jul 09 '19 at 21:10
The difference is not distinctly visible in a dataset with 10 samples. I used your code for my dataset (provided in the question part) and didn't get `c4` column that I wanted. – Muser Jul 09 '19 at 21:15

score 1 · Accepted Answer · answered Jul 09 '19 at 21:04

The reason .transform(pd.Series.mode) didn't work is because it returned a list when there were two modes. We can solve this by accessing the first value in this list:

df['c4'] = df.groupby(['c1', 'c2'])['c3'].transform(lambda x: x.mode()[0])

Or

df['c4'] = df.groupby(['c1', 'c2'])['c3'].transform(lambda x: pd.Series.mode(x)[0])

    c1  c2 c3 c4
0    1   1  b  b
1    1   1  b  b
2    1   2  a  b
3    1   2  b  b
4    1   2  b  b
5    1   2  b  b
6    1   2  c  b
7    1   3  a  a
8    1   3  c  a
9    1   3  b  a
10   1   3  c  a
11   1   3  a  a
12   1   3  b  a
13   1   3  a  a
14   1   3  a  a
15   1   3  c  a
16   1   4  a  a
17   2   1  c  c
18   2   1  c  c
19   2   1  a  c
20   2   1  c  c
21   2   1  c  c
22   2   1  b  c
23   2   2  a  a
24   2   2  c  a
25   2   2  a  a
26   2   3  a  a
27   2   3  a  a
28   2   4  c  c
29   2   4  c  c
30   3   1  b  a
31   3   1  a  a
32   3   1  a  a
33   3   1  c  a
34   3   1  b  a
35   3   2  c  c
36   3   2  c  c
37   3   2  b  c
38   3   2  a  c
39   3   2  c  c
40   3   3  b  b
41   3   3  a  b
42   3   3  b  b
43   3   3  c  b
44   3   3  a  b
45   3   3  b  b
46   3   3  b  b
47   3   3  c  b
48   3   4  b  b
49   3   4  c  b

That's it. This is exactly what I wanted. Thank you very much. — Muser, Jul 09 '19 at 21:11

score 0 · Answer 3 · answered Jul 09 '19 at 20:49

0

You could try finding the mode in each group and then merging it back to the set.

modes = df.groupby(['col1', 'col2'])['col3'].apply(pd.Series.mode)
df = df.merge(modes, on=['col1', 'col2'], how='left')

answered Jul 09 '19 at 20:49

ifly6

5,003
2
24
47

It gives this error: ValueError: can not merge DataFrame with instance of type – Muser Jul 09 '19 at 21:20
In your console, type in `pd._version_` – ifly6 Jul 09 '19 at 21:21
My pandas version is: '0.22.0' – Muser Jul 09 '19 at 21:22
I think we should correct it like what Erfan did. – Muser Jul 09 '19 at 21:23

How to replace values of a pandas column with the most frequent value

3 Answers3

Linked