How to overlap R histograms

Question

Reproduced from this code:

library(haven)
library(survey)
library(dplyr)

nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))

# Rename variables into something more readable
nhanesDemo$fpl <- nhanesDemo$INDFMPIR
nhanesDemo$age <- nhanesDemo$RIDAGEYR
nhanesDemo$gender <- nhanesDemo$RIAGENDR
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
nhanesDemo$psu <- nhanesDemo$SDMVPSU
nhanesDemo$strata <- nhanesDemo$SDMVSTRA

nhanesAnalysis <- nhanesDemo %>%
  mutate(LowIncome = case_when(
    INDFMIN2 < 40 ~ T,
    T ~ F
  )) %>%
  # Select the necessary columns
  select(INDFMIN2, LowIncome, persWeight, psu, strata)

# Set up the design
nhanesDesign <- svydesign(id      = ~psu,
                          strata  = ~strata,
                          weights = ~persWeight,
                          nest    = TRUE,
                          data    = nhanesAnalysis)

svyhist(~log10(INDFMIN2), design=nhanesDesign, main = '')

How do I color the histogram by independent variable, say, LowIncome? I want to have two separate histograms, one for each value of LowIncome. Unfortunately I picked a bad example, but I want them to be see-through in case their values overlap.

Allan Cameron · Accepted Answer · 2023-02-15T10:08:18.150

If you want to plot a histogram from your model, you can get its data from model.frame (this is what svyhist does under the hood). To get the histogram filled by group, you could use this data frame inside ggplot:

library(ggplot2)

ggplot(model.frame(nhanesDesign), aes(log10(INDFMIN2), fill = LowIncome)) +
  geom_histogram(alpha = 0.5, color = "gray60", breaks = 0:20 / 10) +
  theme_classic()

Edit

As Thomas Lumley points out, this does not incorporate sampling weights, so if you wanted this you could do:

ggplot(model.frame(nhanesDesign), aes(log10(INDFMIN2), fill = LowIncome)) +
  geom_histogram(aes(weight = persWeight), alpha = 0.5, 
                 color = "gray60", breaks = 0:20 / 10) +
  theme_classic()

To demonstrate this approach works, we can replicate Thomas's approach in ggplot using the data example from svyhist. To get the uneven bin sizes (if this is desired), we need two histogram layers, though I'm guessing this would not be required for most use-cases.

ggplot(model.frame(dstrat), aes(enroll)) +
  geom_histogram(aes(fill = "E", weight = pw, y = after_stat(density)),
                 data = subset(model.frame(dstrat), stype == "E"),
                 breaks = 0:35 * 100,
                 position = "identity", col = "gray50") +
  geom_histogram(aes(fill = "Not E", weight = pw, y = after_stat(density)),
                 data = subset(model.frame(dstrat), stype != "E"),
                 position = "identity", col = "gray50",
                 breaks = 0:7 * 500) +
  scale_fill_manual(NULL, values = c("#00880020", "#88000020")) +
  theme_classic()

The problem with this is that it doesn't use the sampling weights — Thomas Lumley, Feb 15 '23 at 01:14
@ThomasLumley good point. I have updated to show how to use the sampling weights within the ggplot framework, including a ggplot version of your nicely demonstrated example. Thank you. — Allan Cameron, Feb 15 '23 at 10:05

score 1 · Answer 2 · answered Feb 15 '23 at 01:47

You can't just extract the data and use ggplot, because that won't use the weights and so misses the whole point of svyhist. You can use the add=TRUE argument, though. You do need to set the x and y axis ranges correctly to make sure the whole plot is visible

Using the data example from ?svyhist

svyhist(~enroll, subset(dstrat,stype=="E"), col="#00880020",ylim=c(0,0.003),xlim=c(0,3500))
svyhist(~enroll, subset(dstrat,stype!="E"), col="#88000020",add=TRUE)

How to overlap R histograms

2 Answers2