0

I am using the data found here: https://www.kaggle.com/cdc/behavioral-risk-factor-surveillance-system. In my R studio, I have named the csv file, BRFSS2015. Below is the code I am trying to execute. I have created two new columns comparing people who have arthritis vs. people who do not have arthritis (arth and no_arth). Grouping by these variables, I am now trying to find the mean and sd for their weights. The weight variable was generated from another variable in the dataset using this code: (weight = BRFSS2015$WEIGHT2) Below is the code I am trying to run for mean and sd.

BRFSS2015%>%
  group_by(arth,no_arth)%>%
  summarize(mean_weight=mean(weight),
            sd_weight=sd(weight))

I am getting output that says mean and sd for these two groups is identical. I doubt this is correct. Can someone check and tell me why this is happening? The numbers I am getting are:

arth: mean = 733.2044; sd= 2197.377 no_arth: mean= 733.2044; sd= 2197.377

Here is how I created the variables arth and no_arth:

a=BRFSS2015%>%
  select(HAVARTH3)%>%
  filter(HAVARTH3=="1")
b=BRFSS2015%>%
  select(HAVARTH3)%>%
  filter(HAVARTH3=="2")

as.data.frame(BRFSS2015)
arth=c(a)
no_arth=c(b)
BRFSS2015$arth <- c(arth, rep(NA, nrow(BRFSS2015)-length(arth)))
BRFSS2015$no_arth <- c(no_arth, rep(NA, nrow(BRFSS2015)-length(no_arth)))
as.tibble(BRFSS2015)

Before I started, I also removed NAs from weight using weight=na.omit(WEIGHT2)

jakdar
  • 77
  • 6
  • Based on the code you provided one can only guess. Most likely reason is that something went wrong when you created your `arth` and `no_arth` columns. For me your code works fine when I do `%>% mutate(arth = HAVARTH3 == 1, no_arth = HAVARTH3 == 2, weight = WEIGHT2)`before group_by + summarise. – stefan Oct 09 '22 at 08:28
  • Do you mean create `arth` and `no_arth` in the way you are describing and just delete my version? Were you getting non-identical numbers? – jakdar Oct 09 '22 at 08:36
  • It's not necessary to delete your versions. They will be overwritten by the code I provided. And yes I get mean=735., sd=2203 for arth = TRUE and mean=727., sd=2185. for no_arth=TRUE – stefan Oct 09 '22 at 08:41
  • Okay, now I am using this code: `BRFSS2015%>% mutate(arth=HAVARTH3==1,no_arth=HAVARTH3==2)%>% group_by(arth,no_arth)%>% summarize(mean_weight=mean(WEIGHT2,na.rm=T), sd_weight=sd(WEIGHT2,na.rm=T))` For some reason, I am getting an output that is a 4x4 df that includes numbers for arth=FALSE and no_arth=FALSE. Why is this happening? – jakdar Oct 09 '22 at 08:54
  • Great. Also see my answer which as a reference for future questions shows how to create a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and how create a small snippet of data. – stefan Oct 09 '22 at 09:02
  • You get a df with 4 rows because there are also values where HAVARTH3 is not 1 or 2. That's why in my answer I dropped these values and hence got only a df with two rows. Moreover after doing so it's actually sufficient to create (or group_by) just one of the columns. – stefan Oct 09 '22 at 09:05

1 Answers1

1

Based on the info you provided one can only guess what when wrong in your analysis. But here is a working code using a snippet of the real data.

library(tidyverse)

BRFSS2015_minimal <- structure(list(HAVARTH3 = c(
  1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 2,
  1, 1, 1, 1, 1, 1, 2, 1, 2
), WEIGHT2 = c(
  280, 165, 158, 180, 142,
  145, 148, 179, 84, 161, 175, 150, 9999, 140, 170, 128, 200, 178,
  155, 163
)), row.names = c(NA, -20L), class = c(
  "tbl_df", "tbl",
  "data.frame"
))

BRFSS2015_minimal %>%
  filter(!is.na(WEIGHT2), HAVARTH3 %in% 1:2) %>%
  mutate(arth = HAVARTH3 == 1, no_arth = HAVARTH3 == 2,weight = WEIGHT2) %>%
  group_by(arth, no_arth) %>%
  summarize(
    mean_weight = mean(weight),
    sd_weight = sd(weight),
    .groups = "drop"
  )
#> # A tibble: 2 × 4
#>   arth  no_arth mean_weight sd_weight
#>   <lgl> <lgl>         <dbl>     <dbl>
#> 1 FALSE TRUE            165      10.8
#> 2 TRUE  FALSE           865    2629.

Code used to create dataset

BRFSS2015 <- readr::read_csv("2015.csv")
 
BRFSS2015_minimal <- dput(head(BRFSS2015[c("HAVARTH3", "WEIGHT2")], 20))
stefan
  • 90,330
  • 6
  • 25
  • 51