0

Below is what I needed to do to get to the part where I attempt to implement seaborn's barplot.

import matplotlib.pyplot as plt 
import seaborn as sns 
import pandas as pd 
import statsmodels.api as sm 
import numpy as np

da = pd.read_csv("nhanes_2015_2016.csv")

da["DMDMARTL"] = da.DMDMARTL.fillna("Missing")
da["DMDMARTLdescript"] = da.DMDMARTL.replace({1: "Married", 2: "Widowed", 3: "Divorced", 4: "Separated", 5: "Never married", 
                             6: "Living with partner",       77: "Refused", 99: "Don't know"})

da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})

da["agegrp"] = pd.cut(da.RIDAGEYR, [10, 20, 30, 40, 50, 60, 70, 80])

I pieced together bits of code here and there and arrived at what I have below.

y = "prop"
dx = da.loc[~da.RIAGENDRx.isin(["Male"]), :]
plt.figure(figsize=(12, 5))
prop_df = (dx["agegrp"]
       .groupby(dx["DMDMARTLdescript"])
       .value_counts(normalize=True)
       .rename(y)
       .reset_index())
sns.barplot(x="agegrp", y=y, hue="DMDMARTLdescript", data=prop_df)

The result of running the code above is the following

Image

I have following issues with the plot it generates.

  1. Although I have asked each age group to be normalized `(normalized = True), based on the image, it's fairly obvious that the sum of the bars in each age group exceeds 1.

  2. The age groups are ordered along the x axis in a somewhat arbitrary way. I am not sure how to order them in the numerical order.

(the csv file is publicly available here github link.)

Blackwidow
  • 146
  • 6
  • Concerning (1.) the normalization takes place according to the descript values. I.e. all "divorced" cases sum up to 1. – ImportanceOfBeingErnest Jan 20 '19 at 21:35
  • Please print output of `print(prop_df.groupby(['DMDMARTLdescript']).sum())` and check. And please provide actual sample as we do not have your `.csv` file. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – Parfait Jan 20 '19 at 22:46
  • @ImportanceOfBeingErnest thank you for your input. So I thought each age group is a divorced case and the sum of the bars in each age group would amount to 1. But I see that the red bar in [10,20] alone is already 1. – Blackwidow Jan 21 '19 at 13:40
  • Yes, because the red bar is the only case of "Missing" so `sum("Missing")` is indeed 1. – ImportanceOfBeingErnest Jan 21 '19 at 13:41
  • @ImportanceOfBeingErnest I ran `print(prop_df.groupby(['DMDMARTLdescript']).sum())` and I see what you mean. Is there a way I can make the sum of the bars in each age group normalized instead? – Blackwidow Jan 21 '19 at 13:45

0 Answers0