Replace A Column Value by Most Frequent Value In Group

Question

I have the following dataframe:

df = pd.DataFrame({'student': list('AAABBBCCCC'),
                   'city': ['LA', 'LA', 'NY', 'DC', 'NY', 'NY', 'SF', 'SF', 'LA', 'SF'],
                   'score': [75, 27, 31, 22, 43, 20, 26, 40, 33, 20]})

print(df)

  student city  score
0       A   LA     75
1       A   LA     27
2       A   NY     31
3       B   DC     22
4       B   NY     43
5       B   NY     20
6       C   SF     26
7       C   SF     40
8       C   LA     33
9       C   SF     20

I have to replace the city column based on which city appears most for that student. For example, here is the value_counts() of the city for each student:

df.groupby('student').city.value_counts()

student  city
A        LA      2
         NY      1
B        NY      2
         DC      1
C        SF      3
         LA      1

We can see, for the student A, LA appears most of the time. Hence, we want to replace other cities (here NY) with LA.

Desired output:

  student city  score
0       A   LA     75
1       A   LA     27
2       A   LA     31
3       B   NY     22
4       B   NY     43
5       B   NY     20
6       C   SF     26
7       C   SF     40
8       C   SF     33
9       C   SF     20

What would be the ideal way of getting the desired output? Any suggestions would be appreciated. Thanks!

score 1 · Accepted Answer · answered Aug 29 '22 at 18:33

You can get the most frequent value with mode, which is a bit faster than value_counts. Then you can use groupby().transform() to broadcast the values to all the rows:

# lambda x: x.value_counts().index[0] would also work
df['city'] = df.groupby('student')['city'].transform(lambda x: x.mode()[0])

Output:

  student city  score
0       A   LA     75
1       A   LA     27
2       A   LA     31
3       B   NY     22
4       B   NY     43
5       B   NY     20
6       C   SF     26
7       C   SF     40
8       C   SF     33
9       C   SF     20

Note mode() doesn't count NaN values, so if a student is missing city on all rows, mode() would throw an error.

Replace A Column Value by Most Frequent Value In Group

1 Answers1