0

I have 2 columns in dataset, one is Country other is Gender , Gender has some NaN or N/A values. In pandas it look like so:

import pandas as pd    
Country = ['United Kingdom', 'Bosnia and Herzegovina', 'Thailand', 'United States', 'Ukraine', 'Canada', 'Ukraine', 'India', 'New Zealand', 'India', 'Antigua and Barbuda', 'Canada', 'United States', 'Germany', 'India', 'United Kingdom', 'Australia', 'Russian Federation', 'Brazil', 'Lithuania']
Gender = ['Man', 'Man', 'Man', 'Man', 'Man', 'Man', 'Man', 'Man', 'Man', nan, 'Man', 'Woman', 'Man', 'Man', 'Man', 'Man', 'Man', 'Man', 'Man', 'Man']

I want to plot missing values as a proportion of all values of Gender in a Country category. I first use groupby on Country column which gave me only total count of gender within a country, what will be the best way to plot missing values as proportion of total gender value in a country:

EDIT: Answered this question in comments, please view solution, if you think the solution could be better please comment under the solution comment.

1 Answers1

0

Hi here is I created a solution for it,

  • first I created a similar dataframe that has all values in Gender = NaN,
  • then I replaced NaN with 1s
  • I grouped this dataframe by Country and summed Gender column count
  • then on the original data frame I again filled NaNs with 1s (This could be done in first step)
  • then grouped this dataframe by Country and summed Gender column
  • I joined these 2 dataframes on axis 1
  • I created a column that takes dataframe 1 Gender count (count of Gender = NaN per country), multilplies by 100 and divides by the second data frame Gender count
  • Then I plotted this new column as a bar plot

enter image description here

    Country = ['United Kingdom', 'Bosnia and Herzegovina', 'Thailand', 'United States', 'Ukraine', 'Canada', 'Ukraine', 'India', 'New Zealand', 'India', 'Antigua and Barbuda', 'Canada', 'United States', 'Germany', 'India', 'United Kingdom', 'Australia', 'Russian Federation', 'Brazil', 'Lithuania']
    Gender = ['Man', 'Man', 'Man', 'Man', 'Man', 'Man', 'Man', 'Man', 'Man', nan, 'Man', 'Woman', 'Man', 'Man', 'Man', 'Man', 'Man', 'Man', 'Man', 'Man']
    survey = pd.DataFrame({ 'Country' :Country , 'Gender':Gender})
    null_gender["Gender"].fillna(1, inplace = True) # filling 1 instead of NaN for the count in next line of code
    freq = null_gender.groupby('Country')['Gender'].count().rename('NullCount') # Number of NaN in Gender Column per country

    survey["Gender"].fillna(1, inplace = True)
    freq_full =survey.groupby('Country')['Gender'].count().rename('Totalcount')
    total_freq = pd.concat([freq, freq_full], axis=1)
    total_freq = total_freq.dropna(how='any',axis=0) 
    total_freq['null_percent'] = total_freq.apply(lambda row: round((row.NullCount*100) /row.Totalcount, 2) , axis = 1) 
    total_freq.sort_values('null_percent', ascending=True, inplace=True)
    total_freq['null_percent'].plot(kind='barh', figsize=(10,25))