Efficient recursive random sampling

Question

Imagine a df in the following format:

The problem is to randomly select one row (ideally adjustable to n rows) for the first unique value in ID1, remove the corresponding ID2 value from the dataset, randomly select a value from the remaining pool of ID2 values for the second ID1 value (i.e. recursively), and so on.

So, for example, for the first ID1 value, it would do sample(1:5, 1), with the result 2. For the second ID1 value, it would do sample(c(1, 3:5), 1), with the result 3. For the third ID1 value, it would do sample(c(1, 4:5), 1), with the result 5. It cannot happen that there isn't at least one unique ID2 value left to assign to a particular ID1. However, with multiple ID2 values to select (e.g. three), it may happen that there isn't a sufficient number of them; in that case, select as much as possible. In the end, the results should have a similar format:

It should be efficient enough to handle reasonably large datasets (tens of thousands unique values in ID1 and hundreds of thousands unique values per ID2).

I tried multiple ways to solve this problem, but honestly none of them are meaningful and would likely only contribute to confusion, so I'm not sharing them here.

Sample data:

df <- data.frame(ID1 = rep(LETTERS[1:3], each = 5),
                 ID2 = rep(1:5, 3))

Does every `ID1` group consist of the same values? Or would it be possible for `ID1 == "B"` to have `ID2` values for example `2:6` instead of `1:5`? If they are all the same, I suggest sampling the (unique) values with out replacement and then add the `ID1`. — Martin Gal, Nov 03 '21 at 11:40
@Martin Gal indeed, in my real data, the ID2 values between ID1 are not completely overlapping. It could be `c(1:5)` and `c(1:7)` etc. — tmfmnk, Nov 03 '21 at 11:43
What does "*randomly sample one row for the first unique value in ID1*" mean? — s_baldur, Nov 03 '21 at 11:48
@sindri_baldur you have the first unique value in ID1 (i.e. A). For this value, you randomly select a value from ID2 (i.e. 1:5). Suppose you select 2. Thus, for the second unique ID1 value (i.e. B), you are sampling only from `c(1, 4:5)` because 2 was already selected and thus removed. — tmfmnk, Nov 03 '21 at 11:52
Why sequential? Wouldn't sampling 3 numbers at once have the same result in terms of likelihood. Draw 3 without replacement versus draw 3 without replacement one at a time? — s_baldur, Nov 03 '21 at 11:54
@sindri_baldur if 3 numbers from ID2 were randomly sampled without replacement, then it is possible that a given ID1 won't be selected. It must be ensured that all ID1 values have a ID2 value. — tmfmnk, Nov 03 '21 at 11:57
Is removing necessary or do you just want to have each ID1 one time and no repetition in ID2 as result? — mmw, Nov 03 '21 at 14:59
@mmw the removal is not necessary. It is conceptually about having each ID1 exactly one time with no repetition in ID2. — tmfmnk, Nov 03 '21 at 15:08
Could the overlap also be `1:5` and `2:7` or `c(1,3,5)` and `c(3,4,5,8)`? — GKi, Nov 16 '21 at 12:50

score 16 · Answer 1 · edited Nov 14 '21 at 11:01

Possible Solutions

Below are some approaches:

base R recursion using Reduce + subset
max bipartite matching using igraph
base R dynamic programming using for loops

1. Recursion

You can try the code below (Reduce is applied to recursively adding unvisited ID2 values)

lst <- split(df, ~ID1)
lst[[1]] <- lst[[1]][sample(1:nrow(lst[[1]]), 1), ]
Reduce(
  function(x, y) {
    y <- subset(y, !ID2 %in% x$ID2)
    rbind(x, y[sample(nrow(y), 1), ])
  },
  lst
)

which gives

2. Bipartite Matching

As we can see, this problem can be interpreted as a matching problem in graph theory

library(igraph)
library(dplyr)

g <- df %>%
  arrange(sample(n())) %>%
  graph_from_data_frame() %>%
  set_vertex_attr(
    name = "type",
    value = names(V(.)) %in% df$ID1
  )

type.convert(
  setNames(
    rev(
      stack(
        max_bipartite_match(g)$matching[unique(df$ID1)]
      )
    ), names(df)
  ),
  as.is = TRUE
)

and we can get

3. `for` loop Dynamic Programming

  lst <- with(df, split(ID2, ID1))
  v <- c()
  for (k in seq_along(lst)) {
    u <- lst[[k]][!lst[[k]] %in% v]
    v <- c(v, u[sample(length(u), 1)])
  }
  type.convert(
    data.frame(ID1 = names(lst), ID2 = v),
    as.is = TRUE
  )

which gives

I think this is a great answer. I added a benchmark of the solutions you provided, I hope this is ok with you. Please feel free to revert to the previous state if it is not. — missuse, Nov 14 '21 at 10:57
I just saw there was a benchmark answer so I will update that. I Removed the edit. — missuse, Nov 14 '21 at 11:02

Allan Cameron · Answer 2 · 2021-11-09T20:04:25.150

I think this algorithm does what you want, but it's not very efficient. It may provide others with a starting point for faster solutions.

all_ID1 <- unique(df$ID1)
available <- unique(df$ID2)
new_ID2 <-  numeric(length(all_ID1))

for(i in seq_along(all_ID1))
{
  ID2_group <- df$ID2[df$ID1 == all_ID1[i]]
  sample_space <- ID2_group[ID2_group %in% available]
  new_ID2[i]<- sample(sample_space, 1)
  available <- available[available != new_ID2[i]]
}

data.frame(ID1 = all_ID1, ID2 = new_ID2)
#>   ID1 ID2
#> 1   A   5
#> 2   B   1
#> 3   C   2

Note that this will not work if you run out of unique ID2 values. For example, if you had letters A:F in the ID1 column, each with ID2 values of 1:5, then by the time you get to selecting an ID2 value for the ID1 value "F", there are no unique ID2 values left, since numbers 1 to 5 have all been assigned to letters A:E. You don't state in your question what should happen when there are no unique ID2 values left to assign to a particular ID1 - should they be NA, or are repeats allowed at that point?

Edit

The following modification allows arbitrary n to be chosen. If all the available numbers run out, the sample space gets replenished:

AC_function <- function(ID1, ID2, n = 1)
{
  all_ID1   <- rep(unique(ID1), each = n)
  available <- unique(ID2)
  new_ID2   <- numeric(length(all_ID1))

   for(i in seq_along(all_ID1))
   {
     ID2_group    <- ID2[ID1 == all_ID1[i]]
     sample_space <- ID2_group[ID2_group %in% available]
     
     if(length(sample_space) < 1) {
        available    <- unique(ID2)
        sample_space <- ID2_group[ID2_group %in% available]
     }
     if(length(sample_space) == 1) {
        new_ID2[i] <- sample_space
        available <- available[available != new_ID2[i]]
     }
     else {
        new_ID2[i]   <- sample(sample_space, 1)
        available    <- available[available != new_ID2[i]]
     }
   }

  data.frame(ID1 = all_ID1, ID2 = new_ID2)
}

For example:

AC_function(df$ID1, df$ID2)
#>   ID1 ID2
#> 1   A   2
#> 2   B   4
#> 3   C   5

AC_function(df$ID1, df$ID2, n = 2)
#>   ID1 ID2
#> 1   A   1
#> 2   A   2
#> 3   B   5
#> 4   B   4
#> 5   C   3
#> 6   C   2

^{Created on 2021-11-03 by the reprex package (v2.0.0)}

Thanks for this solution, looks promising! Just to clarify, it cannot happen that there are no unique ID2 values left to assign to a particular ID1. — tmfmnk, Nov 03 '21 at 12:19
Your solution works beautifully for n = 1. Could it be extended to handle n > 1, while also reflecting the fact the for some ID1 there wouldn't be enough ID2 values left to assign (in that case select as much as possible)? — tmfmnk, Nov 09 '21 at 11:10
@tmfmnk it seems to run about a quarter of the speed of the fastest solutions here. If it does what you need and speed is very important, I could translate to Rcpp to give it a major speed boost. — Allan Cameron, Nov 09 '21 at 20:52

score 4 · Answer 3 · answered Nov 03 '21 at 12:07

4

selected <- c()

for(i in unique(df[,1])) {

    x <- df[df[,"ID1"]==i,"ID2"]

    y <- setdiff(x,selected)
    selected <- unique(c(sample(y,1),selected))
    

}

data.frame(ID1 = unique(df[,1]), ID2 =selected)

gives,

answered Nov 03 '21 at 12:07

maydin

3,715
3
10
27

This is nice, but suffers from the same issue that I discuss in the footnote to my answer. Still, I think that is unavoidable without some more info from the OP, so +1 – Allan Cameron Nov 03 '21 at 12:09
@AllanCameron Yes. You are right. It is unclear. – maydin Nov 03 '21 at 12:24
It's a nice answer, but on my actual data, the resulting ID1-ID2 pairs do not correspond to the initial pairs. That is, the ID2 values for ID1 are not the set of ID2 values actually present for them (i.e. a 6 when the set is just 1:5). – tmfmnk Nov 03 '21 at 15:05
@tmfmnk In my answer the set is not restricted with 1:5. The codes are subseting corresponding ID2 wrt the selected ID1. It can be any set, any range etc. Anyway, Can you please provide a more realistic dataset which represents your actual issue and problematic cases? It is so hard to get your point via reading the comments. – maydin Nov 03 '21 at 15:11

GKi · Answer 4 · 2021-11-17T08:48:37.473

You can use sample in Reduce on the split df.

df <- data.frame(ID1 = rep(LETTERS[1:3], each = 5),
                 ID2 = rep(1:5, 3))
set.seed(42)

. <- split(df$ID2, df$ID1)
data.frame(ID1 = `storage.mode<-`(names(.), typeof(df$ID1)),
           ID2 = Reduce(function(x, y) {
             y <- y[!y %in% x]
             c(x, y[sample.int(length(y),1)])}, c(list(NULL), .)))
#  ID1 ID2
#1   A   1
#2   B   2
#3   C   3

Or using a for loop:

. <- split(df$ID2, df$ID1)
x <- df$ID2[0]
for(y in .) {
  y <- y[!y %in% x]
  x <- c(x, y[sample.int(length(y),1)])
}
data.frame(ID1 = `storage.mode<-`(names(.), typeof(df$ID1)), ID2 = x)
#  ID1 ID2
#1   A   1
#2   B   2
#3   C   3

Or using fastmatch and dqrng instead of base:

library(fastmatch)
library(dqrng)
. <- split(df$ID2, df$ID1)
x <- df$ID2[0]
for(y in .) {
  y <- y[!y %fin% x]
  x <- c(x, y[dqsample.int(length(y),1)])
}
data.frame(ID1 = `storage.mode<-`(names(.), typeof(df$ID1)), ID2 = x)
#  ID1 ID2
#1   A   2
#2   B   1
#3   C   5

and creating the result vector with final size:

library(fastmatch)
library(dqrng)
. <- split(df$ID2, df$ID1)
x <- vector(typeof(df$ID2), length(.))
for(i in seq_along(.)) {
  y <- .[[i]]
  y <- y[!y %fin% x[seq_len(i-1)]]
  x[i] <- y[dqsample.int(length(y),1)]
}
data.frame(ID1 = `storage.mode<-`(names(.), typeof(df$ID1)), ID2 = x)
#  ID1 ID2
#1   A   3
#2   B   1
#3   C   2

well, what I mean is, if we have only `5` unvisited, then `sample(5,1)` gives a random integer from 1 to 5, but it should be exactly 5 as the desired output. In this case, `5[sample(length(5),1)]` should be a stable expression (although it looks "stupid"). Of course, if we have more than one value unvisited, it works without any problem. — ThomasIsCoding, Nov 09 '21 at 14:55

score 4 · Answer 5 · edited Nov 16 '21 at 12:47

Welcome to update the benchmark!

df <- data.frame(
  ID1 = rep(LETTERS, each = 10000),
  ID2 = sample(1000, length(LETTERS) * 10000, replace = TRUE)
)

f_TIC1 <- function() {
  lst <- split(df, ~ID1)
  lst[[1]] <- lst[[1]][sample(1:nrow(lst[[1]]), 1), ]
  Reduce(
    function(x, y) {
      y <- subset(y, !ID2 %in% x$ID2)
      rbind(x, y[sample(nrow(y), 1), ])
    },
    lst
  )
}

library(igraph)
library(dplyr)
f_TIC2 <- function() {
  g <- df %>%
    arrange(sample(n())) %>%
    graph_from_data_frame() %>%
    set_vertex_attr(
      name = "type",
      value = names(V(.)) %in% df$ID1
    )

  type.convert(
    setNames(
      rev(
        stack(
          max_bipartite_match(g)$matching[unique(df$ID1)]
        )
      ), names(df)
    ),
    as.is = TRUE
  )
}

f_TIC3 <- function() {
  lst <- with(df, split(ID2, ID1))
  v <- c()
  for (k in seq_along(lst)) {
    u <- lst[[k]][!lst[[k]] %in% v]
    v <- c(v, u[sample(length(u), 1)])
  }
  type.convert(
    data.frame(ID1 = names(lst), ID2 = v),
    as.is = TRUE
  )
}

f_GKi1 <- function() {
  . <- split(df$ID2, df$ID1)
  data.frame(ID1 = type.convert(names(.), as.is=TRUE),
    ID2 = Reduce(function(x, y) {c(x, sample(y[!y %in% x], 1))}, c(list(NULL), .)))
}

f_GKi2 <- function() {
  . <- split(df$ID2, df$ID1)
  x <- df$ID2[0]
  for(y in .) {
    y <- y[!y %in% x]
    x <- c(x, y[sample.int(length(y),1)])
  }
  data.frame(ID1 = type.convert(names(.), as.is=TRUE), ID2 = x)
}

library(fastmatch)
library(dqrng)
f_GKi3 <- function() {
  . <- split(df$ID2, df$ID1)
  x <- df$ID2[0]
  for(y in .) {
    y <- y[!y %fin% x]
    x <- c(x, y[dqsample.int(length(y),1)])
  }
  data.frame(ID1 = type.convert(names(.), as.is=TRUE), ID2 = x)
}

f_GKi4 <- function() {
  . <- split(df$ID2, df$ID1)
  x <- vector(typeof(df$ID2), length(.))
  for(i in seq_along(.)) {
    y <- .[[i]]
    y <- y[!y %fin% x[seq_len(i-1)]]
    x[i] <- y[dqsample.int(length(y),1)]
  }
  data.frame(ID1 = type.convert(names(.), as.is=TRUE), ID2 = x)
}

f_Onyambu <- function() {
  data <- df[order(df$ID1, df$ID2),] #Just in case it is not sorted
  n <- 1
  st <- table(data[[1]])
  s <- min(st)
  m <- length(st) 
  size <- min(m*n, s) 
  samples <- sample(s, size)
  index <- rep(seq(s), each = n, length = size) * s - s + samples
  data[index, ]
}

bm <- microbenchmark::microbenchmark(
  f_TIC1(),
  f_TIC2(),
  f_TIC3(),
  f_GKi1(),
  f_GKi2(),
  f_GKi3(),
  f_GKi4(),
  f_Onyambu()
)
ggplot2::autoplot(bm)
bm
#Unit: milliseconds
#        expr       min        lq      mean    median        uq       max neval
#    f_TIC1()  43.85147  46.00637  48.77332  46.53265  48.06150  86.60333   100
#    f_TIC2() 138.12085 143.15468 154.59155 146.49701 169.47343 191.70579   100
#    f_TIC3()  13.30333  13.89822  15.16400  14.49575  15.57266  52.16352   100
#    f_GKi1()  13.42718  13.88382  16.22395  14.31689  15.69188  52.70818   100
#    f_GKi2()  13.34032  13.80074  14.70703  14.52709  15.46372  17.80398   100
#    f_GKi3()  11.86203  12.09923  14.73456  12.26890  13.84257  50.41542   100
#    f_GKi4()  11.86614  12.08120  13.19142  12.20973  13.74152  50.82025   100
# f_Onyambu() 201.06478 203.11184 206.04584 204.10129 205.60191 242.28008   100

Currently GKi3 and GKi4 are the fastest followed by TIC3, GKi1 and GKi2 which are more or less equal as they use the same logic from TIC1, which was optimizes in GKi1 and reused in TIC3 and GKi2.

The sorting part should not be part of the solution. That is an overhead. Seems the data given is most likely sorted. You only sort if it is unsorted. BUT The code used above assumes it is not sorted. You should either use if else or sort it outside the function. The code given uses base R and is the fastest. Also note that the question needs to sample n>1. All the other answersonly sample n=1 — Onyambu, Nov 16 '21 at 17:59

Onyambu · Accepted Answer · 2021-11-16T21:03:25.790

DISCLAIMER: This solution assumes that the data is arranged/ordered. If the data is not ordered. Please order it according to the ID1 column first then use the function:

There is another way of doing this without using for-loop/ Recursion or even higher level functions. We need to note that sample function in R is vectorized. Therefore, if all the groups in your dataframe are of the same size, or rather increasing in size, then you could make use of the vectorized sample.

n <- 1 # to be sampled from each group
s <- 5 # size of each group - Note that you have to give the minimum size. 
m <- length(unique(df[[1]])) # number of groups.
size <- min(m*n, s) #Total number of sampled data from the dataframe
samples <- sample(s, size)
index <- rep(seq(s), each = n, length = size) * s - s + samples
df[index, ]

This can be written in a function:

sub_sample <- function(data, n){
  st <- table(data[[1]])
  s <- min(st)
  m <- length(st) 
  size <- min(m*n, s) 
  samples <- sample(s, size)
  st1 <- rep(c(0, cumsum(head(st,-1))), each = n, length = size)
  index <- st1 + samples
  data[index, ]
}

sub_sample(df, 1)
   ID1 ID2
1    A   1
7    B   2
13   C   3

sub_sample(df, 2)
   ID1 ID2
1    A   1
5    A   5
8    B   3
7    B   2
14   C   4

Note that when subsetting n=2 we only have 1 group C row. Why? That is because group C has 5 rows. But we have already used 4 samples for groups A and B. We only remain with 1 sample for group C.

Speed Test:

when n = 1:

Unit: milliseconds
              expr        min         lq      mean     median        uq       max neval
          f_TIC1()  35.682390  41.610310  53.68912  45.618234  49.88343 227.73160   100
          f_TIC2() 151.643959 166.402584 196.51770 179.098992 192.16335 401.36526   100
          f_TIC3()  11.059033  12.268831  14.53906  13.278606  15.38623  23.32695   100
          f_GKi1()  10.725358  11.879908  14.70369  13.108852  17.86946  26.71074   100
          f_GKi2()  10.816891  11.941173  16.55455  12.989787  17.92708 198.44482   100
          f_GKi3()   8.942479   9.950978  14.94726  10.857187  13.35428 171.08643   100
          f_GKi4()   9.085794   9.834465  13.98820  10.666282  13.20658 191.47267   100
 sub_sample(df, 1)   7.878367   8.737534  11.22173   9.508084  14.22219  19.82063   100

When n>1, This code easily tackles that. The others need to be tweaked alittle but their speed drops drastically. This method works like a charm even when n = group size. Most of others take too long or even fail

When I use it with `df <- data.frame(ID1 = rep(LETTERS[1:3], each = 7),ID2 = rep(1:7, 3))[-6:-7,]; set.seed(2); sub_sample(df, 1)` I get two samples for `B` and non for `C`. — GKi, Nov 16 '21 at 14:01
Thanks! But now I get with `df <- data.frame(ID1 = rep(LETTERS[1:3], each = 7),ID2 = rep(1:7, 3))[-6:-7,]; set.seed(2); sub_sample(df, 1)` for `A` and for `B` a `5`. — GKi, Nov 16 '21 at 20:36
@Gki. Thanks for that. My starting point was incorrect. I was starting to count from 8 instead of 6 (for the second group since the first group has only 5). I have taken this into consideration and editted accordingly. — Onyambu, Nov 16 '21 at 20:54
Thanks again for the update! But now it samples only for the range `1:5` for `A`, `B` and `C` and never samples `6` or `7` for `B` or `C`. — GKi, Nov 16 '21 at 21:40
@GKi in that case, i would say that the solution works for data with same group sizes. Will have to look onto how to look into situations with different group sizes. One thing am sure about is that this solution takes into accounts for n>1 and equal group sizes — Onyambu, Nov 16 '21 at 21:51

score 0 · Answer 7 · answered Nov 03 '21 at 12:20

A possible approach

library(data.table)
setDT(df)
exclude.values <- as.numeric()
L <- split(df, by = "ID1")
ans <- lapply(L, function(x) {
  sample.value <- sample(setdiff(x$ID2, exclude.values), 1)
  exclude.values <<- c(exclude.values, sample.value)
  return(sample.value)
})

jblood94 · Answer 8 · 2021-11-03T20:45:52.993

If I understand the post correctly, the samples for ID2 should be monotonically increasing.

This seems to be working. The approach is to determine how much "slack" exists for each ID1, then divvy it up randomly.

Note that it assumes ID2 restarts at 1 for each ID1 and increments up by 1.

dt <- data.table(ID1 = LETTERS[rep.int(1:10, sample(10:20, 10, replace = TRUE))])[, ID2 := 1:.N, by = ID1]

stepSample <- function(dt) {
  dt2 <- dt[, .(n = max(ID2)), by = ID1][, `:=`(slack = rev(cummin(cummin(rev(n)) - rev(.I))), inc = 0L)]
  dtSlack <- data.table(idx = 1:nrow(dt2), slack = dt2$slack)
  
  while (nrow(dtSlack)) {
    if (nrow(dtSlack) == 1L) {
      dt2[dtSlack$idx, inc := inc + sample(0:dtSlack$slack, 1L)]
      break
    } else {
      dt2[sample(dtSlack$idx, 1L), inc := inc + 1L]
      dtSlack <- dtSlack[, slack := slack - 1L][slack != 0L]
    }
  }
  
  return(dt2[, ID2 := .I + cumsum(inc)][, c("ID1", "ID2")])
}

dtSample <- stepSample(dt)

score 0 · Answer 9 · answered Nov 10 '21 at 03:41

Here is another option that uses base R and, I think, satisfies your requirements. I do want to flag that it will quietly exclude an ID1 value if there are no options in ID2 (e.g., if you put n = 5 in the function with your example data you'll see that ID1 == B is excluded.

df <- data.frame(ID1 = rep(LETTERS[1:3], each = 5),
                 ID2 = rep(1:5, 3))

set.seed(1)
andrew_fun(df$ID1, df$ID2, n = 1)
#>   ID1 ID2
#> 1   A   1
#> 2   B   5
#> 3   C   3
andrew_fun(df$ID1, df$ID2, n = 2)
#>   ID1 ID2
#> 1   A   1
#> 2   A   2
#> 3   B   3
#> 4   B   5
#> 5   C   4
#> 6   C   2
andrew_fun(df$ID1, df$ID2, n = 3)
#>   ID1 ID2
#> 1   A   2
#> 2   A   3
#> 3   A   4
#> 4   B   1
#> 5   B   5
#> 6   C   2
#> 7   C   3
#> 8   C   4

Function:

andrew_fun = function(ID1, ID2, n = 1) {
  l = split.default(ID2, ID1)
  l_len = length(l)
  l_vals = vector("list", l_len)

  for(i in seq_along(l)) {
    vec = l[[i]]
    if(n < length(vec)) {
      val = vec[sample.int(length(vec), n)] # sample if there are enough values
    } else {
      val = vec # grab everything if not
    }
    
    l_vals[[i]] = val

    # remove values from next level of ID1
    if(i < l_len) {
      idx = i + 1L
      l[[idx]] = l[[idx]][!l[[idx]] %in% val] 
    }
  }
  data.frame(
    ID1 = rep(names(l), lengths(l_vals)),
    ID2 = unlist(l_vals, use.names = FALSE)
  )
  
}

Efficient recursive random sampling

9 Answers9

Possible Solutions

1. Recursion

2. Bipartite Matching

3. `for` loop Dynamic Programming

Linked

Efficient recursive random sampling

9 Answers9

Possible Solutions

1. Recursion

2. Bipartite Matching

3. for loop Dynamic Programming

Linked

3. `for` loop Dynamic Programming