0

Reproduced from this code:

library(haven)
library(survey)
library(dplyr)

nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))

# Rename variables into something more readable
nhanesDemo$fpl <- nhanesDemo$INDFMPIR
nhanesDemo$age <- nhanesDemo$RIDAGEYR
nhanesDemo$gender <- nhanesDemo$RIAGENDR
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
nhanesDemo$psu <- nhanesDemo$SDMVPSU
nhanesDemo$strata <- nhanesDemo$SDMVSTRA

nhanesAnalysis <- nhanesDemo %>%
  mutate(LowIncome = case_when(
    INDFMIN2 < 40 ~ T,
    T ~ F
  )) %>%
  # Select the necessary columns
  select(INDFMIN2, LowIncome, persWeight, psu, strata)

# Set up the design
nhanesDesign <- svydesign(id      = ~psu,
                          strata  = ~strata,
                          weights = ~persWeight,
                          nest    = TRUE,
                          data    = nhanesAnalysis)

svyhist(~log10(INDFMIN2), design=nhanesDesign, main = '')

enter image description here

How do I color the histogram by independent variable, say, LowIncome? I want to have two separate histograms, one for each value of LowIncome. Unfortunately I picked a bad example, but I want them to be see-through in case their values overlap.

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
Forklift17
  • 2,245
  • 3
  • 20
  • 32

2 Answers2

3

If you want to plot a histogram from your model, you can get its data from model.frame (this is what svyhist does under the hood). To get the histogram filled by group, you could use this data frame inside ggplot:

library(ggplot2)

ggplot(model.frame(nhanesDesign), aes(log10(INDFMIN2), fill = LowIncome)) +
  geom_histogram(alpha = 0.5, color = "gray60", breaks = 0:20 / 10) +
  theme_classic() 

enter image description here


Edit

As Thomas Lumley points out, this does not incorporate sampling weights, so if you wanted this you could do:

ggplot(model.frame(nhanesDesign), aes(log10(INDFMIN2), fill = LowIncome)) +
  geom_histogram(aes(weight = persWeight), alpha = 0.5, 
                 color = "gray60", breaks = 0:20 / 10) +
  theme_classic() 

enter image description here

To demonstrate this approach works, we can replicate Thomas's approach in ggplot using the data example from svyhist. To get the uneven bin sizes (if this is desired), we need two histogram layers, though I'm guessing this would not be required for most use-cases.

ggplot(model.frame(dstrat), aes(enroll)) +
  geom_histogram(aes(fill = "E", weight = pw, y = after_stat(density)),
                 data = subset(model.frame(dstrat), stype == "E"),
                 breaks = 0:35 * 100,
                 position = "identity", col = "gray50") +
  geom_histogram(aes(fill = "Not E", weight = pw, y = after_stat(density)),
                 data = subset(model.frame(dstrat), stype != "E"),
                 position = "identity", col = "gray50",
                 breaks = 0:7 * 500) +
  scale_fill_manual(NULL, values = c("#00880020", "#88000020")) +
  theme_classic()

enter image description here

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • The problem with this is that it doesn't use the sampling weights – Thomas Lumley Feb 15 '23 at 01:14
  • @ThomasLumley good point. I have updated to show how to use the sampling weights within the ggplot framework, including a ggplot version of your nicely demonstrated example. Thank you. – Allan Cameron Feb 15 '23 at 10:05
1

You can't just extract the data and use ggplot, because that won't use the weights and so misses the whole point of svyhist. You can use the add=TRUE argument, though. You do need to set the x and y axis ranges correctly to make sure the whole plot is visible

Using the data example from ?svyhist

svyhist(~enroll, subset(dstrat,stype=="E"), col="#00880020",ylim=c(0,0.003),xlim=c(0,3500))
svyhist(~enroll, subset(dstrat,stype!="E"), col="#88000020",add=TRUE)

enter image description here

Thomas Lumley
  • 1,893
  • 5
  • 8