2

In the given dataset, case_control indicates whether a row is a case or control, id is an identifier which is unique for case but it can be repeated for control and group indicates cluster. I need to select one control per case within each group but if a control is previous selected for a case, it cannot be selected for the next case, based on the id variable. If there are no available controls, the case will have to be dropped.

How can I achieve this to work quickly in a very large dataset with ~10 million rows (with 2 mil cases and 8 mil controls)?

Dataset looks like this(https://docs.google.com/spreadsheets/d/1MpjKv9Fm_Hagb11h_dqtDX4hV7G7sZrt/edit#gid=1801722229)

group       case_control  id
cluster_1   case          11
cluster_1   control       21
cluster_1   control       22
cluster_1   control       23
cluster_2   case          12
cluster_2   control       21
cluster_2   control       22
cluster_2   control       24
cluster_3   case          13
cluster_3   control       21
cluster_3   control       22
cluster_3   control       25

Expected output must look like this

group       case_control    id
cluster_1   case            11
cluster_1   control         21
cluster_2   case            12
cluster_2   control         22
cluster_3   case            13
cluster_3   control         25
Rizwan S A
  • 77
  • 5
  • are there always the same number of control observations per case? – Donald Seinen Dec 09 '21 at 09:41
  • HI @DonaldSeinen, No, the number of controls can vary within each cluster. – Rizwan S A Dec 09 '21 at 09:51
  • 1
    Since speed is important (is this a recurring task?) would you mind changing the structure to suit the task? Example - an integer matrix (groups 1:n, boolean for 1/0 for case_control) would probably speed up any subsetting action here. – Donald Seinen Dec 09 '21 at 09:56
  • @DonaldSeinen, Yes i can do that. Does it really make such as big to have these variable types? – Rizwan S A Dec 09 '21 at 09:57
  • 1
    [lapply vs for](https://stackoverflow.com/questions/42393658/lapply-vs-for-loop-performance-r) It is one method to avoid performance issues when the same function is applied to a larger data set, like @wimpel's answer below. Especially if an approach copies data. For your sample data a conversion to integer matrix would take ~7x less space in memory. Also, there exist many packages optimized for matrix manipulation that may be faster than other solutions because they can avoid type checks, such as `Rfast`. – Donald Seinen Dec 09 '21 at 10:07
  • @DonaldSeinen Ok i ll try that. I havent used Rfast at all. – Rizwan S A Dec 09 '21 at 10:24
  • Let's say cluster_2 only had control id 21, would you then be able to go back and reassign control id 22 to cluster_1? If so then id there a need to maximize the number of cases that aren't dropped? – Dean MacGregor Dec 10 '21 at 02:16
  • @DeanMacGregor Yes, you are right. We have to make the most of the available controls, so that we don't miss a case just because it was lower down the order. You point is very valid. – Rizwan S A Dec 11 '21 at 03:06

2 Answers2

2

Base R:

Reduce(\(x,y)rbind(x, y[which(!y$id %in% x$id)[1:2], ]), split(df[-(3:4),], ~group))

       group case_control id
1  cluster_1         case 11
2  cluster_1      control 21
5  cluster_2         case 12
7  cluster_2      control 22
9  cluster_3         case 13
12 cluster_3      control 25

Note that we just need the first case and the first non-duplicated control for each cluster, thus slicing 1:2

Tidyverse:

df %>%
  slice(-(3:4))%>%
  group_split(group) %>%
  reduce(~rbind(.x, slice(anti_join(.y, .x, by = c("case_control", "id")), 1:2)))

# A tibble: 6 x 3
  group     case_control    id
  <chr>     <chr>        <int>
1 cluster_1 case            11
2 cluster_1 control         21
3 cluster_2 case            12
4 cluster_2 control         22
5 cluster_3 case            13
6 cluster_3 control         25
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • I always seem to forget about the power of `Reduce`... nice answer!! – Wimpel Dec 09 '21 at 08:18
  • Dear @Onyambu, the base solution throws this error for me. `Error: unexpected input in "Reduce(\"`. Any idea what this might be due to? I am more interested in this solution because i think this will be the fastest solution for my use case. – Rizwan S A Dec 09 '21 at 09:02
  • @RizwanSA that is because you are using older R version. Change `\(X, y)` to `function(X, y)` – Onyambu Dec 09 '21 at 16:16
2

Here is a data.table approach.

The code can be shortened (a lot), but I choose to keep each step separated (and commented), so you can see what actions are taken and can inspect intermediate results.

library(data.table)
#initialise vector for used ids
id.used <- as.numeric()
#split by group and loop 
L <- lapply(split(DT, by = "group"), function(x) {
  #select first row
  caserow <- x[1,]
  #select second to last row
  controlrow <- x[2:nrow(x), ]
  #match against id's already in use
  controlrow.new <- controlrow[!id %in% id.used, ]
  #sample random row from id's not already used
  controlrow.sample <- controlrow.new[controlrow.new[, .I[sample(.N, 1)], ]]
  #fill id.used (be carefull with the use of <<- !! google why..)
  id.used <<- c(id.used, controlrow.sample$id)
  #rowbind the sampled row to the caserow
  return(rbind(caserow, controlrow.sample))
})
# rowbind the list back together and cast to wide
dcast(rbindlist(L), group ~ case_control, value.var = "id")
#        group case control
# 1: cluster_1   11      21
# 2: cluster_2   12      24
# 3: cluster_3   13      25

sample data used

DT <- fread("group       case_control  id
cluster_1   case          11
cluster_1   control       21
cluster_1   control       22
cluster_1   control       23
cluster_2   case          12
cluster_2   control       21
cluster_2   control       22
cluster_2   control       24
cluster_3   case          13
cluster_3   control       21
cluster_3   control       22
cluster_3   control       25")
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • Hi @Wimpel. This solution works perfectly. I really mustn't complain with the speed in my case, because this is as good as it gets with data.table, right? I added a little feature to your code from library `pbapply` where i replace the `lapply` with `pblapply` and it gives me a progress bar. Thank you so much for the commenting and step by step explanation. – Rizwan S A Dec 09 '21 at 09:14
  • Hi @Wimpel, is there something that i can do to make it run faster than now, my run times are increasing with time for some reason, which i dont understand. – Rizwan S A Dec 09 '21 at 09:20