I am new to R and I am struggling to find a code to age-match 10 controls per case. All the cases and controls are in one data frame and are assigned 'Case' or 'Control' in a 'Group' column. I want to make a new data frame with age-matched cases and controls that are +/- 2years of the age of the cases. My dataframe has 112 cases and 4910 controls which should be enough. This is a small proportion of my dataframe, please let me know if the sample data is too small:
structure(list(Sex = c("F", "M", "F", "M", "M", "M", "F", "F",
"F", "F", "M", "M", "M", "M", "F", "F", "M", "M", "F", "F", "F",
"F", "M", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "F",
"F", "M", "F", "F", "M", "F", "F", "M", "M", "M", "F", "M", "F",
"F", "M", "F", "M", "M", "M", "M", "F", "F", "M", "F", "M", "F",
"M", "M", "F", "F", "F", "M", "F", "F", "F", "M", "M", "F", "M",
"M", "M", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F", "M",
"F", "M", "M", "F", "F", "F", "M", "M", "M", "F", "F", "F", "M",
"M", "F", "F", "F", "M", "F", "F", "M", "M", "M", "F", "F", "F",
"M", "F"), mcv = c(89, 90, 86, 87, 90, 88, 85, 90, 92, 89, 87,
95, 92, 94, 89, 87, 93, 90, 96, 94, 88, 101, 83, 97, 79, 91,
92, 89, 90, 93, 88, 94, 92, 89, 97, 98, 80, 92, 87, 95, 85, 91,
89, 89, 94, 77, 92, 92, 82, 92, 85, 105, 96, 102, 89, 87, 87,
95, 93, 88, 93, 82, 88, 86, 87, 88, 89, 89, 91, 90, 90, 85, 95,
88, 91, 88, 87, 92, 91, 92, 92, 80, 80, 96, 85, 90, 88, 89, 86,
91, 91, 76, 94, 86, 94, 84, 88, 92, 101, 91, 93, 98, 98, 91,
86, 84, 91, 90, 88, 88, 83, 91, NA, 101), Age = c(52, 63, 72,
52, 66, 59, 51, 63, 68, 53, 64, 70, 70, 78, 59, 55, 54, 54, 83,
61, 51, 72, 57, 67, 72, 52, 55, 52, 95, 79, 60, 61, 73, 69, 65,
55, 53, 77, 79, 54, 64, 54, 65, 71, 63, 52, 54, 63, 69, 70, 56,
80, 54, 67, 59, 71, 56, 73, 53, 61, 71, 73, 74, 63, 82, 60, 52,
65, 75, 66, 74, 71, 58, 52, 53, 55, 91, 73, 62, 51, 74, 73, 64,
60, 58, 63, 63, 59, 72, 52, 85, 51, 61, 56, 60, 64, 73, 78, 57,
52, 62, 64, 70, 62, 58, 69, 84, 72, 71, 63, 73, 63.3, 62.3, 59.56
), Group = c("Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Control", "Control",
"Control", "Control", "Control", "Control", "Case", "Case", "Case"
)), row.names = c(NA, -114L), class = c("tbl_df", "tbl", "data.frame"
))
I have tried codes from other questions but they don't work:
Matching controls to cases using multiple conditions in r
library(dplyr, warn.conflicts = F)
dat %>%
split(.$group) %>%
list2env(envir = .GlobalEnv)
control$FILTER <- FALSE
control
set.seed(123)
for(i in seq_len(nrow(case))){
x <- which(between(control$age, case$age[i] -2, case$age[i] +2) &
!control$FILTER)
control$FILTER[sample(x, min(10, length(x)))] <- TRUE
}
control
bind_rows(case, control) %>% filter(FILTER | is.na(FILTER)) %>% select(-FILTER)
The product of this code above had 30 controls missing.
case_data <- dat %>% filter(group == 'case')
control_data <- dat %>% filter(group == 'control')
case_data %>%
group_split(row_number(), .keep = FALSE) %>%
map_df(~bind_rows(.x, control_data %>%
filter(between(age, .x$age - 2, .x$age + 2)) %>%
slice_sample(n = 10)))
The product of this code above was an error:
Error in `slice_sample()`:
! Problem while computing indices.
Caused by error in `sample.int()`:
! invalid first argument
My expected outcome is:
mcv | Age | Group |
---|---|---|
100 | 62 | Case |
99 | 61 | Control |
98 | 63 | Control |
101 | 60 | Control |
87 | 64 | Control |
98 | 62 | Control |
95 | 62 | Control |
99 | 63 | Control |
97 | 60 | Control |
90 | 63 | Control |
102 | 64 | Control |
98 | 70 | Case |
90 | 69 | Control |
98 | 70 | Control |
99 | 71 | Control |
100 | 71 | Control |
98 | 72 | Control |
96 | 68 | Control |
109 | 68 | Control |
98 | 69 | Control |
90 | 70 | Control |
100 | 70 | Control |
So on...
Does anyone know another code or know why these don't work? I appreciate any help.