0

I am new to R and programming. I need to plot a dummy variable as a fraction of age groups. I have created the dummy and completed a count. How do I create a fraction of dummy per age group x?

data


meps_2013<-
  meps_2013%>%
  select(dupersid,age13x,ipdis13,sex)%>%
  mutate(hospsD = ifelse(meps_2013$ipdis13 >= 1 & meps_2013$ipdis13 <= 9, 1, 0))
meps_2013

# A tibble: 36,940 × 3
   dupersid age13x ipdis13
      <dbl>  <dbl>   <dbl>
 1 20004101     39       0
 2 20004102     40       0
 3 20004103     10       0
 4 20005101     52       0
 5 20005102     22       0
 6 20005103     19       0
 7 20006101     43       0
 8 20006102     42       0
 9 20006103     15       0
10 20006104     21       0
# … with 36,930 more rows

ipdis13 is variable used to create dummy.

Here is what I have:

Dummy variable: hospsD includes survey responses 1-9 equal to 1, 0 otherwise;

meps_2013<-
  meps_2013%>%
  select(dupersid,age13x,ipdis13,sex)%>%
  mutate(hospsD = ifelse(meps_2013$ipdis13 >= 1 & meps_2013$ipdis13 <= 9, 1, 0))
meps_2013

no_hospD <- ifelse(meps_2013$ipdis13 == 0, 1, 0)

count(meps_2013, c("hospsD", "no_hospD"))
  hospsD no_hospD  freq
1      0        0     2
2      0        1 34694
3      1        0  2244

lm plot with error

summary_data <- meps_2013%>%
  group_by(age13x)%>%
  summarize(mean_hosps = mean(hospsD,na.rm=TRUE))
  
ggplot(summary_data, aes(x = age13x, y = mean_hosps)) +
  geom_smooth(method = "lm") +
  geom_point() +
  labs(x="Age", y="Hospitalizations")
summary_data

Error in FUN(X[[i]], ...) : object 'age13x' not found

setDT(meps_2013)[, .(Frac = sum(hospsD == 1, na.rm = TRUE)), by = age13x][, Frac := Frac/sum(Frac)][]
  • When you say "a fraction of dummy against x", what do you mean by "against x"? You can calculate the proportion of 1s with `mean(hospsD)` (the mean of a binary variable is the proportion of 1s), but I don't know what you mean by "against x". What is x? – Gregor Thomas Oct 06 '22 at 14:07
  • (Also not sure what this has to do with linear regression) – Gregor Thomas Oct 06 '22 at 15:00
  • Hi, Greg. Specifically, I need to plot the number of hospitalizations per age group as a fraction, with the hospitalizations as the dummy variable. So in this case fraction would be the mean? I will then run a regression on these variables. If I run it as is, I get length errors, “all arguments must be if the same length: y” – CarolineM84 Oct 06 '22 at 16:01
  • Assuming you have an age group column, you can pick your favorite method from the [FAQ on calculating mean by group](https://stackoverflow.com/q/11562656/903061) to get the proportion of hospitalizations by age group. (I'd recommend the `dplyr` method). If you need help running your regression, please share a few rows of sample data including all the relevant columns, as well as the code you tried that produced the error. Since your response variable is binary, I'd strongly recommend using logistic regression with `glm` instead of ordinary linear regression with `lm`. – Gregor Thomas Oct 06 '22 at 16:44
  • I will try this and come back. Thank you! – CarolineM84 Oct 06 '22 at 16:50

1 Answers1

0

Add the binary flag to your data frame.

meps_2013 <-
  meps_2013 %>%
  select(dupersid, age13x, ipdis13) %>%
  mutate(hospsD = ifelse(ipdis13 >= 1 & ipdis13 <= 9, 1, 0))

Then you can calculate the proportion by group:

summary_data <- meps_2013 %>%
  group_by(age13x) %>%
  ## use summarize to get 1 row per group
  summarize(mean_hosps = mean(hospsD,na.rm=TRUE))

And then plot the summarized data:

ggplot(summary_data, aes(x = age13x, y = mean_hosps))+
  geom_smooth(method = "lm") +
  geom_point()+
  labs(x = "Age", y = "Hospitalizations")

I can't really test because the data you've shared is all 0s, but I think this should work.

You could improve by including the number of points in each group in the summary, (put n = n() inside summarize()) and then put aes(weights = n) in the geom_smooth() layer.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thank you! I was able to plot but am now receiving an error 'age13x' not found. Pulling my hair out. The initial plot reported a horizontal line at the 6 percent mean of hospitalizations across all age groups. Is this possible? Perhaps because it is an lm model? – CarolineM84 Oct 06 '22 at 21:33
  • Make sure your `age13x` column is numeric class, not `factor` class in the original data. None of this code should drop that column--if it disappears in the `summary_data` then perhaps you've accidentally loaded the old `plyr` package after `dplyr` and are inadvertently using `plyr::summarize` instead of `dplyr::summarize`. You can specify `dplyr::summarize` in the code to make sure you use the right version, or you can restart R and be careful not to load `plyr` (or any packages that load `plyr`). [See this FAQ for more explanation](https://stackoverflow.com/q/26106146/903061). – Gregor Thomas Oct 07 '22 at 13:43
  • Glad to hear! If this solved your problem please "accept" the answer by clicking the check mark in the left margin next to it. – Gregor Thomas Oct 27 '22 at 01:22