3

I have a dataframe looks like below:

    Name    Gender  
0   John    0   
1   John    1   
2   Linda   1   
3   Lisa    0   
4   Lisa    1
5   Lisa    1   
6   Tom     0
7   Tom     1
8   John    0 

In this dataframe, name like John is corresponding with two gender value 0 and 1. I want to:

  1. Count the frequency of names(e.g.John) being 0 and John being 1
  2. Return a new dataframe of (e.g John) corresponding with the most appeared gender value
  3. If gender value 0 and 1 has the same val_count, return 1

The returned dataframe should look like below

    Name    Gender  
0   John    0       
1   Linda   1   
2   Lisa    1       
3   Tom     0

Is there a Python Panda code can solve this instead of using for loop?

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 3
    Pay attention that Tom has to be 1 (according to 3 - If gender value 0 and 1 has the same val_count, return 1) – theletz Jul 05 '20 at 05:51

2 Answers2

4

Just group on name and find the mode?

df.groupby('Name')['Gender'].agg(lambda x: x.mode().max())

Name
John     0
Linda    1
Lisa     1
Tom      1
Name: Gender, dtype: int64

"mode" is the "most frequently occurring value". If there are multiple modes, pd.Series.mode returns all of them, so we return the largest one.


A similar way, avoids a lambda:

df.groupby('Name')['Gender'].apply(pd.Series.mode).max(level=0, axis=0)

Name
John     0
Linda    1
Lisa     1
Tom      1
Name: Gender, dtype: int64
cs95
  • 379,657
  • 97
  • 704
  • 746
1

Because the gender is a binary value - what you want is to calculate the avg value of the gender and check if it is greater than or equal to 0.5 :

new_df = df.groupby('name')['gender'].mean()
new_df = new_df.reset_index()
new_df['gender'] = (new_df['gender']>=0.5).astype(int)
new_df


    name    gender
0   Jhon    0
1   Linda   1
2   Lisa    1
3   Tom     1

For each name, it calculates the average value, it means that if Jhon has [0,0,1] the average is 0.3333, while if it would have [1,0,1] the average would be 0.6666.

If the average is greater than 0.5 it means that there are more ones than zeros and vice versa. That's exactly what we are checking with new_df['gender']>=0.5. Than we have to convert it from boolean (True/False) to int (True will become 1 and False 0) - we do this using astype(int).

theletz
  • 1,713
  • 2
  • 16
  • 22
  • Thank you, but what about value with more 0 and less 1. In those case, I need to return 0. For example, John has two 0, and one 1. How did you make it to return 0. Could you explain a bit? –  Jul 05 '20 at 15:09
  • @KeFeng I added an explanation. Let me know if it is clear now :) – theletz Jul 06 '20 at 07:39