Get proportion of missing values per Country

Question

I would like to find the proportion of missing values of my features on each country and on all years to select the countries.

I tried this:

df[indicators].isna().mean().sort_values(ascending=False)

but it gives me the proportion of missing values for each indicator only...

i would like this output :

jezrael · Accepted Answer · 2022-02-16T10:51:18.087

0

You can use DataFrame.melt for reshape and then aggregate mean of missing values:

df1 = (df.melt(id_vars='Country Name', value_vars=indicators)
         .set_index('Country Name')['value'].isna()
         .groupby('Country Name')
         .mean()
         .reset_index(name='Prop'))

Or reshape by DataFrame.stack:

df1 = (df.set_index('Country Name')[indicators]
         .stack(dropna=False)
         .isna()
         .groupby('Country Name')
         .mean()
         .reset_index(name='Prop')
        )

Or use custom function:

df1 = (df.groupby('Country Name')[indicators]
         .apply(lambda x: np.mean(x.isna().to_numpy(), axis=None))
         .reset_index(name='Prop'))

edited Feb 16 '22 at 10:51

answered Feb 16 '22 at 09:35

jezrael

822,522
95
1,334
1,252

Thank you for your response, but the result is all NaN – Giordano Feb 16 '22 at 09:41
@Giordano - So `print(df[indicators].isna().sum(axis=1))` return `NaN`s ? – jezrael Feb 16 '22 at 09:50
It is counting the number of NaN for each rows, it's working ! but i would like to display only one the countries and not the years, assuming that the proportion is already calculate for all years. – Giordano Feb 16 '22 at 09:59
@Giordano - I think Iknow problem. Need `.groupby(['Country Name'], as_index=False)` instead `.groupby(['Country Name', 'Year'], as_index=False)` – jezrael Feb 16 '22 at 10:01
for exemple i have 19 times Afghanistan and 5 indicators, so 19 x 5 = 95 values. I have only 7 NaN, so the proportion should be p = 7/95 = 0.07 and i have 0.36 now – Giordano Feb 16 '22 at 10:02
@Giordano - per country there is multiple same years? Not per country is unique year? – jezrael Feb 16 '22 at 10:03
i have only unique years per country – Giordano Feb 16 '22 at 10:05
@Giordano - hmmm, so why aggregate by county and years? then it has no sense. I think need aggregate by countries, because duplicated. Answer was edited. – jezrael Feb 16 '22 at 10:06
Thank you that's it, i need now to add each columns together to add all indicators values – Giordano Feb 16 '22 at 10:12
@Giordano - Can you add expected output from data sample in question? Because not undertand what need. – jezrael Feb 16 '22 at 10:13
question edited with desired output – Giordano Feb 16 '22 at 10:18
@Giordano - what is `indicators` ? – jezrael Feb 16 '22 at 10:18
all the 5 columns : Internet users (per 100 people), GDP per capita (current US$) Population, ages 15-24, total Population of the official age for upper secondary education, both sexes (number) Population of the official age for tertiary education, both sexes (number) – Giordano Feb 16 '22 at 10:20
@Giordano - and how is count `Albania` if no in data sample in question? Also `0.07` is count 7/25 ? But is is not `0.07` in data in question – jezrael Feb 16 '22 at 10:22
sorry i made it by hand because i have all the countries in the world, and i have calculate the proportion by hand with df.head() – Giordano Feb 16 '22 at 10:24
@Giordano - answer was edited, now working well? – jezrael Feb 16 '22 at 10:25
still have 0.36 of proportion instead of 0.07 i don't get it, i need to check that again – Giordano Feb 16 '22 at 10:27
@Giordano - how many values is per `Afganistan` ? And how many `NaN`s ? – jezrael Feb 16 '22 at 10:29
95 values and 7 NaN – Giordano Feb 16 '22 at 10:30
@Giordano - Solution was edited. – jezrael Feb 16 '22 at 10:44
1

Perfect ! Thank you very much ! – Giordano Feb 16 '22 at 10:50

Get proportion of missing values per Country

1 Answers1

Linked