Matching controls to cases using multiple conditions

Question

I want to match 2 controls for every case with two conditions:

the age difference should between ±2;
the income difference should between ±2.

If there are more than 2 controls for a case, I just need select 2 controls randomly. There is an example:

EXAMPLE

DATA

dat = structure(list(id = c(1, 2, 3, 4, 111, 222, 333, 444, 555, 666, 
                     777, 888, 999, 1000), 
              age = c(10, 20, 44, 11, 12, 11, 8, 12,  11, 22, 21, 18, 21, 18), 
              income = c(35, 72, 11, 35, 37, 36, 33,  70, 34, 74, 70, 44, 76, 70), 
              group = c("case", "case", "case", "case", "control", "control", 
                        "control", "control", "control", "control", "control", 
                        "control", "control", "control")), 
         row.names = c(NA, -14L), class = c("tbl_df", "tbl", "data.frame"))

> dat
# A tibble: 14 x 4
      id   age income group  
   <dbl> <dbl>  <dbl> <chr>  
 1     1    10     35 case   
 2     2    20     72 case   
 3     3    44     11 case   
 4     4    11     35 case   
 5   111    12     37 control
 6   222    11     36 control
 7   333     8     33 control
 8   444    12     70 control
 9   555    11     34 control
10   666    22     74 control
11   777    21     70 control
12   888    18     44 control
13   999    21     76 control
14  1000    18     70 control

EXPECT OUTCOME

For id = 1, the matched controls as below, and I just need select 2 controls randomly in the table below.

id	age	income	group
111	12	37	control
222	11	36	control
333	8	33	control
555	11	34	control

For id = 2,the matched controls as below, and I just need select 2 controls randomly in the table below.

id	age	income	group
666	22	74	control
777	21	70	control
1000	18	70	control

For id = 3,there is no matched controls in dat.

For id = 4, the matched controls as below, and I just need select 2 controls randomly in the table below.

One thing to note here is that we can find that the controls for id = 1 and id = 4 have overlapping parts. I don't want two cases to share a control, what I need is that if id = 1 chooses id = 111 and id = 222 as control, then id = 4 can only choose id = 555 as control, and if id = 1 chooses id = 111 and id = 333 as control, then id = 4 can only choose id = 222 and id = 555 as controls.

id	age	income	group
111	12	37	control
222	11	36	control
555	11	34	control

The final output maybe like this(the id in control group is randomly selected from the id that meets the conditions):

id	age	income	group
1	10	35	case
2	20	72	case
3	44	11	case
4	11	35	case
111	12	37	control
222	11	36	control
333	8	33	control
555	11	34	control
777	21	70	control
1000	18	70	control

NOTE

I've looked up some websites, but they don't meet my needs. I don't know how to implement my requirements using R code.

Any help will be highly appreciated!

Reference:

1.https://stackoverflow.com/questions/56026700/is-there-any-package-for-case-control-matching-individual-1n-matching-in-r-n

2.Case control matching in R (or spss), based on age, sex and ethnicity?

3.Matching case-controls in R using the ccoptimalmatch package

4.Exact Matching in R

@AnilGoyal, I update `dat`. Yes, I want random selection in final output also. — zhiwei li, Jun 26 '21 at 13:08
@ AnilGoyal, In my real situation, the `control` choices overlap, so I've updated the example. — zhiwei li, Jun 26 '21 at 15:08
Overlap part I understand, but why 6 controls in output? Do you mean 2 per I'd? — AnilGoyal, Jun 26 '21 at 15:13

AnilGoyal · Accepted Answer · 2021-06-27T05:29:12.433

As per modified requirement, I propose the following for loop

library(dplyr, warn.conflicts = F)

dat %>%
  split(.$group) %>%
  list2env(envir = .GlobalEnv)
#> <environment: R_GlobalEnv>

control$FILTER <- FALSE
control
#> # A tibble: 10 x 5
#>       id   age income group   FILTER
#>    <dbl> <dbl>  <dbl> <chr>   <lgl> 
#>  1   111    12     37 control FALSE 
#>  2   222    11     36 control FALSE 
#>  3   333     8     33 control FALSE 
#>  4   444    12     70 control FALSE 
#>  5   555    11     34 control FALSE 
#>  6   666    22     74 control FALSE 
#>  7   777    21     70 control FALSE 
#>  8   888    18     44 control FALSE 
#>  9   999    21     76 control FALSE 
#> 10  1000    18     70 control FALSE

set.seed(123)

for(i in seq_len(nrow(case))){
  x <- which(between(control$age, case$age[i] -2, case$age[i] +2) & 
               between(control$income, case$income[i] -2, case$income[i] + 2) & 
               !control$FILTER)
  control$FILTER[sample(x, min(2, length(x)))] <- TRUE
}

control
#> # A tibble: 10 x 5
#>       id   age income group   FILTER
#>    <dbl> <dbl>  <dbl> <chr>   <lgl> 
#>  1   111    12     37 control TRUE  
#>  2   222    11     36 control TRUE  
#>  3   333     8     33 control TRUE  
#>  4   444    12     70 control FALSE 
#>  5   555    11     34 control TRUE  
#>  6   666    22     74 control FALSE 
#>  7   777    21     70 control TRUE  
#>  8   888    18     44 control FALSE 
#>  9   999    21     76 control FALSE 
#> 10  1000    18     70 control TRUE

bind_rows(case, control) %>% filter(FILTER | is.na(FILTER)) %>% select(-FILTER)
#> # A tibble: 10 x 4
#>       id   age income group  
#>    <dbl> <dbl>  <dbl> <chr>  
#>  1     1    10     35 case   
#>  2     2    20     72 case   
#>  3     3    44     11 case   
#>  4     4    11     35 case   
#>  5   111    12     37 control
#>  6   222    11     36 control
#>  7   333     8     33 control
#>  8   555    11     34 control
#>  9   777    21     70 control
#> 10  1000    18     70 control

Check results for a different seed

set.seed(234)
for(i in seq_len(nrow(case))){
  x <- which(between(control$age, case$age[i] -2, case$age[i] +2) & 
               between(control$income, case$income[i] -2, case$income[i] + 2) & 
               !control$FILTER)
  control$FILTER[sample(x, min(2, length(x)))] <- TRUE
}
control

bind_rows(case, control) %>% filter(FILTER | is.na(FILTER)) %>% select(-FILTER)

# A tibble: 10 x 4
      id   age income group  
   <dbl> <dbl>  <dbl> <chr>  
 1     1    10     35 case   
 2     2    20     72 case   
 3     3    44     11 case   
 4     4    11     35 case   
 5   111    12     37 control
 6   222    11     36 control
 7   333     8     33 control
 8   555    11     34 control
 9   777    21     70 control
10  1000    18     70 control

dat modified before proceeding for id 3

split the data into two groups case and control using baseR's `split
save two as separate dfs using list2env
using purrr::map_df you can take sample 2 rows for each case
- once for age
- and once for income
finally sample 2 rows again from each of these results
bind_rows again these with case also

library(tidyverse)

dat = structure(list(id = c(1, 2, 3, 111, 222, 333, 444, 555, 666, 777, 888, 999, 1000), 
                     age = c(10, 20, 44, 12, 11, 8, 12, 11, 22, 21, 18, 21, 18), 
                     income = c(35, 72, 11, 37, 36, 33, 70, 34, 74, 70, 44, 76, 70), 
                     group = c("case", "case", "case", "control", "control", "control", 
                               "control", "control", "control", "control", "control", 
                               "control", "control")),
                row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"))

dat
#> # A tibble: 13 x 4
#>       id   age income group  
#>    <dbl> <dbl>  <dbl> <chr>  
#>  1     1    10     35 case   
#>  2     2    20     72 case   
#>  3     3    44     11 case   
#>  4   111    12     37 control
#>  5   222    11     36 control
#>  6   333     8     33 control
#>  7   444    12     70 control
#>  8   555    11     34 control
#>  9   666    22     74 control
#> 10   777    21     70 control
#> 11   888    18     44 control
#> 12   999    21     76 control
#> 13  1000    18     70 control

dat %>%
  split(.$group) %>%
  list2env(envir = .GlobalEnv)
#> <environment: R_GlobalEnv>

set.seed(123)
bind_rows(case, map_dfr(case$age, ~ control %>% filter(between(age, .x -2, .x +2) ) %>%
       sample_n(min(n(),2))) %>% sample_n(min(n(),2)),
       map_dfr(case$income, ~ control %>% filter(between(income, .x -2, .x +2)) %>%
                 sample_n(min(n(),2))) %>% sample_n(min(n(),2)))
#> # A tibble: 7 x 4
#>      id   age income group  
#>   <dbl> <dbl>  <dbl> <chr>  
#> 1     1    10     35 case   
#> 2     2    20     72 case   
#> 3     3    44     11 case   
#> 4   222    11     36 control
#> 5   777    21     70 control
#> 6   111    12     37 control
#> 7   333     8     33 control

the below code will also do the same without saving individual dfs

dat %>%
  split(.$group) %>%
  {bind_rows(.$case, 
             map_dfr(.$case$age, \(.x) .$control %>% filter(between(age, .x -2, .x +2) ) %>%
                       sample_n(min(n(),2))) %>% sample_n(min(n(),2)),
             map_dfr(.$case$income, \(.x) .$control %>% filter(between(income, .x -2, .x +2)) %>%
                       sample_n(min(n(),2))) %>% sample_n(min(n(),2)))}

Dear @ AnilGoyal , sorry to bother you again. I have another problem. How do I generate a new variable that indicates the control that each case matches? For example, `Control1` and `Control2` matched by `Case1` are encoded as `group 1`, and `Control1` and `Control2` matched by `Case2` are encoded as `group 2`. I would be very grateful if you could help me. — zhiwei li, Mar 08 '22 at 15:13
@zhiweili, please post the link of question or if not posted as separate question already, please do so. — AnilGoyal, Mar 08 '22 at 16:31
the new question have posted: https://stackoverflow.com/questions/71402703/how-to-obtain-the-specific-grouping-of-cases-and-controls-in-r — zhiwei li, Mar 08 '22 at 23:36

score 3 · Answer 2 · answered Jun 26 '21 at 09:59

Separate case and control in different dataframes. For each row in case_data find the matching rows in control_data and select 2 random rows from it.

Using map_df we can combine everything in one dataframe.

library(dplyr)
library(purrr)

case_data <- dat %>% filter(group == 'case')
control_data <- dat %>% filter(group == 'control')

case_data %>%
  group_split(row_number(), .keep = FALSE) %>%
  map_df(~bind_rows(.x, control_data %>% 
                    filter(between(age, .x$age - 2, .x$age + 2), 
                           between(income, .x$income - 2, .x$income + 2)) %>%
        slice_sample(n = 2)))

#     id   age income group  
#  <dbl> <dbl>  <dbl> <chr>  
#1     1    10     35 case   
#2   333     8     33 control
#3   111    12     37 control
#4     2    20     72 case   
#5   666    22     74 control
#6   777    21     70 control

In my real situation, the `control` choices overlap, so I've updated the example. — zhiwei li, Jun 26 '21 at 15:08

score 3 · Answer 3 · answered Jun 26 '21 at 10:44

You could also the following solution. I wrapped map2 function inside curly braces so that I could choose what variables I would like to use for .x and .y, otherwise %>% would've replace the whole data set as the first argument:

library(dplyr)
library(purrr)

dat %>%
  filter(group == "case") %>%
  group_by(id) %>%
  {map2(.$age, .$income, ~ dat %>% 
          filter(group == "control" & age >= .x - 2 & age <= .x + 2 & 
                   income >= .y - 2 & income <= .y + 2))} %>%
  map_dfr(~ .x %>% 
            slice_sample(n = 2)) %>%
  bind_rows(dat %>% 
              filter(group == "case")) %>%
  arrange(id)

# A tibble: 7 x 4
     id   age income group  
  <dbl> <dbl>  <dbl> <chr>  
1     1    10     35 case   
2     2    20     72 case   
3     3    44     11 case   
4   222    11     36 control
5   333     8     33 control
6   777    21     70 control
7  1000    18     70 control

In my real situation, the `control` choices overlap, so I've updated the example. — zhiwei li, Jun 26 '21 at 15:08