Determining mean based on True/False in Pandas Dataframes

Question

I have got sample dataframe as below. And now trying to find % of Low income group in each country.

Output:

One solution would be :

(DF.Country[DF["Income"] == "Low"].value_counts() * 100 / DF.Country.value_counts())

Another is: Add third column in DF with True/False value against each record if Income='Low'. and then calculate mean.

Is there any better approach to achieve this?

Solution in R would be - tapply(DF$Income == "Low", DF$Country, mean) — Dharm, Sep 27 '20 at 20:52

Joe Ferndz · Answer 1 · 2020-09-27T20:30:26.673

1

You can use value_counts with normalize=True to get to your answer.

df['Income'].value_counts(normalize=True) * 100

If you want to break this down by Country, then you have to add that to the counts.

print (df[['Country','Income']].value_counts(normalize=True) * 100)

This will give me:

Country  Income 
M3       Low        28.571429
M2       VeryLow    14.285714
         Medium     14.285714
         High       14.285714
M1       Low        14.285714
         High       14.285714

Since you are looking for Mean of Income by each country, you have to group by Country.

print (df.groupby('Country')['Income'].value_counts(normalize=True) * 100)

This will give you:

Country  Income 
M1       High        50.000000
         Low         50.000000
M2       High        33.333333
         Medium      33.333333
         VeryLow     33.333333
M3       Low        100.000000

edited Sep 27 '20 at 20:30

answered Sep 27 '20 at 20:14

Joe Ferndz

8,417
2
13
33

Your answer would be true if i need to count mean for all records disregarding Country column. But here mean is required based on group by Country. – Dharm Sep 27 '20 at 20:17
Does the additional info help? – Joe Ferndz Sep 27 '20 at 20:20
Question: Do you want to know the mean compared to all the countries or mean compared to each country? For ex: `Low` for `M3`. It has two records in total. All of them are `Low`. So should `M3` represent `100.00` or `2/7` = `28.57 %` – Joe Ferndz Sep 27 '20 at 20:23
Thanks. But it won't help as value_count running over whole data frame. Somehow this needs to have group by "Income". As mentioned on my question, M3 should have 100%. – Dharm Sep 27 '20 at 20:23
Looks like you are trying to find the % of Income for each country. Let me tweak the answer a bit to give you that. – Joe Ferndz Sep 27 '20 at 20:25
Still one step left as i need only records having Income="Low". – Dharm Sep 27 '20 at 20:50

Grzegorz Skibinski · Answer 2 · 2020-09-27T20:29:39.350

0

You can do:

res = df.groupby("Country")["Income"].value_counts(normalize=True)

res = res.loc[res.index.get_level_values("Income") == "Low"].reset_index(level="Income", drop=True).mul(100).astype(int).astype(str) + " %"

edited Sep 27 '20 at 20:29

answered Sep 27 '20 at 20:24

Grzegorz Skibinski

12,624
2
11
34

Is there more pythonic way by chaining/Boolean indexing or something similar? R code solution for my problem would be tapply(DF$Income == "Low", DF$Country, mean) which looks more elegant. – Dharm Sep 27 '20 at 20:29
I think that's pretty pythonic - filtering of low income however bad looking I think is the fastest to do that way... – Grzegorz Skibinski Sep 27 '20 at 20:32
Elegant, not efficient - `.apply` generally is the slowest solution - see: https://stackoverflow.com/a/54432584/11610186 – Grzegorz Skibinski Sep 27 '20 at 20:34

Determining mean based on True/False in Pandas Dataframes

2 Answers2