I am analyze the data using RpostgreSQL, dplyr, and foreach package.
The total data is 500k rows and the target group is 5000.
I'd like to extract 4 control groups per target. Therefore, the total N of control group will be 20,000.
The problem is that out of 20,000 extracted people, there should not be duplicate values, but there will be duplicates. And control should not be in the trt group.
My code has duplicate values. What should I do?
Thank you very much
The form of the code I used is as follows.
controlgroup <- trtgroup[1,] %>% select(person_id,enroll_date,measurement_date,age,value,gender,case_number) %>% .[0,]
system.time({
controlgroup <- foreach(i=1:50, .combine=rbind, .packages=c('dplyr','lubridate')) %dopar% {
target_patient <- allo_case[i,] %>% select(person_id,enroll_date,measurement_date,age,value,gender,case_number)
cont_list <- controlgroup %>% select(person_id)
control_tmp1 <- enroll5_m %>% filter((gender == target_patient[6] %>% as.numeric) & # gender
(age >= target_patient[4] %>% as.numeric - 5 & age <= target_patient[4] %>% as.numeric + 5) & # age +- 5
(!(person_id %in% (cont_list %>% select(person_id)))) &
(!(person_id %in% (trtgroup %>% select(person_id)))) %>% slice(1:4)
sample_n_number <- if_else(nrow(control_tmp1) >= 4, 4, nrow(control_tmp1) %>% as.double())
control_tmp2 <- control_tmp1 %>% sample_n(sample_n_number) %>% mutate(case_number = i)
return(control_tmp2)
}
})