one name corresponding with two gender, duplicate dataframe

Question

I have a dataframe looks like below:

    Name    Gender  
0   John    0   
1   John    1   
2   Linda   1   
3   Lisa    0   
4   Lisa    1
5   Lisa    1   
6   Tom     0
7   Tom     1
8   John    0

In this dataframe, name like John is corresponding with two gender value 0 and 1. I want to:

Count the frequency of names(e.g.John) being 0 and John being 1
Return a new dataframe of (e.g John) corresponding with the most appeared gender value
If gender value 0 and 1 has the same val_count, return 1

The returned dataframe should look like below

    Name    Gender  
0   John    0       
1   Linda   1   
2   Lisa    1       
3   Tom     0

Is there a Python Panda code can solve this instead of using for loop?

Pay attention that Tom has to be 1 (according to 3 - If gender value 0 and 1 has the same val_count, return 1) — theletz, Jul 05 '20 at 05:51

cs95 · Answer 1 · 2020-07-05T06:08:16.463

4

Just group on name and find the mode?

df.groupby('Name')['Gender'].agg(lambda x: x.mode().max())

Name
John     0
Linda    1
Lisa     1
Tom      1
Name: Gender, dtype: int64

"mode" is the "most frequently occurring value". If there are multiple modes, pd.Series.mode returns all of them, so we return the largest one.

A similar way, avoids a lambda:

df.groupby('Name')['Gender'].apply(pd.Series.mode).max(level=0, axis=0)

Name
John     0
Linda    1
Lisa     1
Tom      1
Name: Gender, dtype: int64

edited Jul 05 '20 at 06:08

answered Jul 05 '20 at 05:50

cs95

379,657
97
704
746

1

For people wondering what ``mode`` can do, https://stackoverflow.com/a/54304691/4985099 – sushanth Jul 05 '20 at 06:02
3

@Sushanth oh hey I think I know the guy who wrote that post. – cs95 Jul 05 '20 at 06:03

theletz · Answer 2 · 2020-07-06T07:38:10.273

1

Because the gender is a binary value - what you want is to calculate the avg value of the gender and check if it is greater than or equal to 0.5 :

new_df = df.groupby('name')['gender'].mean()
new_df = new_df.reset_index()
new_df['gender'] = (new_df['gender']>=0.5).astype(int)
new_df


    name    gender
0   Jhon    0
1   Linda   1
2   Lisa    1
3   Tom     1

For each name, it calculates the average value, it means that if Jhon has [0,0,1] the average is 0.3333, while if it would have [1,0,1] the average would be 0.6666.

If the average is greater than 0.5 it means that there are more ones than zeros and vice versa. That's exactly what we are checking with new_df['gender']>=0.5. Than we have to convert it from boolean (True/False) to int (True will become 1 and False 0) - we do this using astype(int).

edited Jul 06 '20 at 07:38

answered Jul 05 '20 at 05:50

theletz

1,713
2
16
22

Thank you, but what about value with more 0 and less 1. In those case, I need to return 0. For example, John has two 0, and one 1. How did you make it to return 0. Could you explain a bit? – Jul 05 '20 at 15:09
@KeFeng I added an explanation. Let me know if it is clear now :) – theletz Jul 06 '20 at 07:39

one name corresponding with two gender, duplicate dataframe

2 Answers2