0

I am analyze the data using RpostgreSQL, dplyr, and foreach package.

The total data is 500k rows and the target group is 5000.

I'd like to extract 4 control groups per target. Therefore, the total N of control group will be 20,000.

The problem is that out of 20,000 extracted people, there should not be duplicate values, but there will be duplicates. And control should not be in the trt group.

My code has duplicate values. What should I do?

Thank you very much


The form of the code I used is as follows.

controlgroup <- trtgroup[1,] %>% select(person_id,enroll_date,measurement_date,age,value,gender,case_number) %>% .[0,]


system.time({
  controlgroup <- foreach(i=1:50, .combine=rbind, .packages=c('dplyr','lubridate')) %dopar% {
    target_patient <- allo_case[i,] %>% select(person_id,enroll_date,measurement_date,age,value,gender,case_number)
    cont_list <- controlgroup %>% select(person_id)
    control_tmp1 <- enroll5_m %>% filter((gender == target_patient[6] %>% as.numeric) &  # gender
                                                (age >= target_patient[4] %>% as.numeric - 5 & age <= target_patient[4] %>% as.numeric + 5) & # age +- 5
                                                (!(person_id %in% (cont_list %>% select(person_id)))) &
                                                (!(person_id %in% (trtgroup %>% select(person_id)))) %>% slice(1:4)
    sample_n_number <- if_else(nrow(control_tmp1) >= 4, 4, nrow(control_tmp1) %>% as.double())
    control_tmp2 <- control_tmp1 %>% sample_n(sample_n_number) %>% mutate(case_number = i)
    return(control_tmp2)
  }
})

ys y
  • 23
  • 4

1 Answers1

0

One approach to this could be to create a randomised vector of integers the same size as your data, and then split it into a list of 5000 vectors, each containing 4 integers. You then select your control groups based on this list.

d <- sample.int(20000, 20000, replace = FALSE)
split(d, ceiling(seq_along(d)/4))

Example, splitting mtcars into 8 non-overlapping random groups:

set.seed(1724)
df <- mtcars
d <- sample.int(32, 32, replace = FALSE)
i <- split(d, ceiling(seq_along(d)/4))
control <- lapply(i, function(x, df) df[x,], df = df)

Alternatively, purrr::map() might scale better:

control <- purrr::map(i, ~df[.x,], df = df)

See https://stackoverflow.com/a/17773112/8675075 and https://stackoverflow.com/a/3321659/8675075

Paul
  • 2,877
  • 1
  • 12
  • 28