1

I am new to R and I am struggling to find a code to age-match 10 controls per case. All the cases and controls are in one data frame and are assigned 'Case' or 'Control' in a 'Group' column. I want to make a new data frame with age-matched cases and controls that are +/- 2years of the age of the cases. My dataframe has 112 cases and 4910 controls which should be enough. This is a small proportion of my dataframe, please let me know if the sample data is too small:

structure(list(Sex = c("F", "M", "F", "M", "M", "M", "F", "F", 
"F", "F", "M", "M", "M", "M", "F", "F", "M", "M", "F", "F", "F", 
"F", "M", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "F", 
"F", "M", "F", "F", "M", "F", "F", "M", "M", "M", "F", "M", "F", 
"F", "M", "F", "M", "M", "M", "M", "F", "F", "M", "F", "M", "F", 
"M", "M", "F", "F", "F", "M", "F", "F", "F", "M", "M", "F", "M", 
"M", "M", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F", "M", 
"F", "M", "M", "F", "F", "F", "M", "M", "M", "F", "F", "F", "M", 
"M", "F", "F", "F", "M", "F", "F", "M", "M", "M", "F", "F", "F", 
"M", "F"), mcv = c(89, 90, 86, 87, 90, 88, 85, 90, 92, 89, 87, 
95, 92, 94, 89, 87, 93, 90, 96, 94, 88, 101, 83, 97, 79, 91, 
92, 89, 90, 93, 88, 94, 92, 89, 97, 98, 80, 92, 87, 95, 85, 91, 
89, 89, 94, 77, 92, 92, 82, 92, 85, 105, 96, 102, 89, 87, 87, 
95, 93, 88, 93, 82, 88, 86, 87, 88, 89, 89, 91, 90, 90, 85, 95, 
88, 91, 88, 87, 92, 91, 92, 92, 80, 80, 96, 85, 90, 88, 89, 86, 
91, 91, 76, 94, 86, 94, 84, 88, 92, 101, 91, 93, 98, 98, 91, 
86, 84, 91, 90, 88, 88, 83, 91, NA, 101), Age = c(52, 63, 72, 
52, 66, 59, 51, 63, 68, 53, 64, 70, 70, 78, 59, 55, 54, 54, 83, 
61, 51, 72, 57, 67, 72, 52, 55, 52, 95, 79, 60, 61, 73, 69, 65, 
55, 53, 77, 79, 54, 64, 54, 65, 71, 63, 52, 54, 63, 69, 70, 56, 
80, 54, 67, 59, 71, 56, 73, 53, 61, 71, 73, 74, 63, 82, 60, 52, 
65, 75, 66, 74, 71, 58, 52, 53, 55, 91, 73, 62, 51, 74, 73, 64, 
60, 58, 63, 63, 59, 72, 52, 85, 51, 61, 56, 60, 64, 73, 78, 57, 
52, 62, 64, 70, 62, 58, 69, 84, 72, 71, 63, 73, 63.3, 62.3, 59.56
), Group = c("Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Control", "Control", 
"Control", "Control", "Control", "Control", "Case", "Case", "Case"
)), row.names = c(NA, -114L), class = c("tbl_df", "tbl", "data.frame"
))

I have tried codes from other questions but they don't work:

Matching controls to cases using multiple conditions in r

library(dplyr, warn.conflicts = F)

dat %>%
  split(.$group) %>%
  list2env(envir = .GlobalEnv)

control$FILTER <- FALSE
control

set.seed(123)

for(i in seq_len(nrow(case))){
  x <- which(between(control$age, case$age[i] -2, case$age[i] +2) & 
               !control$FILTER)
  control$FILTER[sample(x, min(10, length(x)))] <- TRUE
}

control

bind_rows(case, control) %>% filter(FILTER | is.na(FILTER)) %>% select(-FILTER)

The product of this code above had 30 controls missing.

case_data <- dat %>% filter(group == 'case')
control_data <- dat %>% filter(group == 'control')

case_data %>%
  group_split(row_number(), .keep = FALSE) %>%
  map_df(~bind_rows(.x, control_data %>% 
                    filter(between(age, .x$age - 2, .x$age + 2)) %>%
        slice_sample(n = 10)))

The product of this code above was an error:

Error in `slice_sample()`:
! Problem while computing indices.
Caused by error in `sample.int()`:
! invalid first argument

My expected outcome is:

mcv Age Group
100 62 Case
99 61 Control
98 63 Control
101 60 Control
87 64 Control
98 62 Control
95 62 Control
99 63 Control
97 60 Control
90 63 Control
102 64 Control
98 70 Case
90 69 Control
98 70 Control
99 71 Control
100 71 Control
98 72 Control
96 68 Control
109 68 Control
98 69 Control
90 70 Control
100 70 Control

So on...

Does anyone know another code or know why these don't work? I appreciate any help.

Halloumi
  • 11
  • 3
  • 1
    Hi - Welcome to SO! In order to better help can you provide a minimal working example of your data? You can use the dput() function in R to get started. Currently, we do not know what the `dat` object is so are unable to help. Providing a data example will yield a higher probability of your question being answered promptly. Example of dput: https://stackoverflow.com/questions/49994249/example-of-using-dput. Also a link on reproducible examples: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – bs93 Jan 15 '23 at 03:37
  • 1
    I'm assuming the data is the sample in the linked question. I can tell you that when you remove `slice_sample()` you will see why you get that error. If there aren't 10 observations within 2 years it will just error out. The third case has no controls associated within 2 years, for example. (The age is 44.) Additionally, there is very little sample data to see what other issues could arise here. – Kat Jan 15 '23 at 18:40
  • Hello, thank you for the comments. I edited the original post, please let me know if the data example is wrong. – Halloumi Jan 16 '23 at 03:27

0 Answers0