barplot normalization and ordering of groups (x-axis)

Question

Below is what I needed to do to get to the part where I attempt to implement seaborn's barplot.

import matplotlib.pyplot as plt 
import seaborn as sns 
import pandas as pd 
import statsmodels.api as sm 
import numpy as np

da = pd.read_csv("nhanes_2015_2016.csv")

da["DMDMARTL"] = da.DMDMARTL.fillna("Missing")
da["DMDMARTLdescript"] = da.DMDMARTL.replace({1: "Married", 2: "Widowed", 3: "Divorced", 4: "Separated", 5: "Never married", 
                             6: "Living with partner",       77: "Refused", 99: "Don't know"})

da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})

da["agegrp"] = pd.cut(da.RIDAGEYR, [10, 20, 30, 40, 50, 60, 70, 80])

I pieced together bits of code here and there and arrived at what I have below.

y = "prop"
dx = da.loc[~da.RIAGENDRx.isin(["Male"]), :]
plt.figure(figsize=(12, 5))
prop_df = (dx["agegrp"]
       .groupby(dx["DMDMARTLdescript"])
       .value_counts(normalize=True)
       .rename(y)
       .reset_index())
sns.barplot(x="agegrp", y=y, hue="DMDMARTLdescript", data=prop_df)

The result of running the code above is the following

I have following issues with the plot it generates.

Although I have asked each age group to be normalized `(normalized = True), based on the image, it's fairly obvious that the sum of the bars in each age group exceeds 1.
The age groups are ordered along the x axis in a somewhat arbitrary way. I am not sure how to order them in the numerical order.

(the csv file is publicly available here github link.)

Concerning (1.) the normalization takes place according to the descript values. I.e. all "divorced" cases sum up to 1. — ImportanceOfBeingErnest, Jan 20 '19 at 21:35
Please print output of `print(prop_df.groupby(['DMDMARTLdescript']).sum())` and check. And please provide actual sample as we do not have your `.csv` file. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). — Parfait, Jan 20 '19 at 22:46
@ImportanceOfBeingErnest thank you for your input. So I thought each age group is a divorced case and the sum of the bars in each age group would amount to 1. But I see that the red bar in [10,20] alone is already 1. — Blackwidow, Jan 21 '19 at 13:40
Yes, because the red bar is the only case of "Missing" so `sum("Missing")` is indeed 1. — ImportanceOfBeingErnest, Jan 21 '19 at 13:41
@ImportanceOfBeingErnest I ran `print(prop_df.groupby(['DMDMARTLdescript']).sum())` and I see what you mean. Is there a way I can make the sum of the bars in each age group normalized instead? — Blackwidow, Jan 21 '19 at 13:45

barplot normalization and ordering of groups (x-axis)

0 Answers0