Randomly subsampling a dataframe without replacements in a specific column with R

Question

I have a dataframe with this structure:

> df
factor  y  x
1       2  0
1       3  0
1       1  0
1       2  0
2       3  0
2       1  0
2       3  1
3       4  1
3       3  1 
3       6  3
3       5  2
4       4  1
4       7  8
4       2  1
2       5  3

In the actual dataset, I have 200 rows and different variables: several continuous variables and a factor variable with 70 levels with up to 4 observations each.

I would like to randomly subsample my entire dataframe into 4 groups of equal size without replacements within each group exclusively in the factor variable. In other words, I would like to have each level of the factor variable occurring not more than once per group.

I've tried different solutions. For instance, I tried by sampling the "factor" variable into four groups without replacements as follows:

factor1 <- as.character(df$factor)

set.seed(123)
group1 <- sample(factor, 35,replace = FALSE) 

factor2 <- setdiff(factor1, group1) 
group2 <- sample(factor2, 35,replace = FALSE) 

# and the same for "group3" and "group4"

but then I don't know how to associate the group vectors (group1, group2, etc.) to the other variables in my df ('x' and 'y').

I've also tried with:

group1 <- sample_n(df, 35, replace = FALSE)

but this solution fails as well since my dataframe doesn't include duplicated rows. The only duplicated values are in the factor variable.

Finally, I tried to use the solution proposed in reply to a similar question here, adapted to my case:

random.groups <- function(n.items = 200L, n.groups = 4L,
                          factor = rep(1L, n.items)) {

  splitted.items  <- split(seq.int(n.items), factor)

  shuffled <- lapply(splitted.items, sample)

  1L + (order(unlist(shuffled)) %% n.groups)
}

df$groups <- random.groups(nrow(df), n.groups = 4)

However, the resulting 4 groups include duplicated values for the factor variable, so something is not working properly.

I would really appreciate any idea or suggestion to solve this problem!

use dplyr() : new_df <- df%>% group_by(factor) %>% sample_n(35). Please provide sample output to understand more in detail. — Dinesh Kumar, Oct 21 '21 at 21:21

Bloxx · Accepted Answer · 2021-10-21T22:02:36.593

One way is to group by factor, create a variable of factro's length, arrange by size and length. At the end, you assign a group to each first, second, third and fourth row. You can then filter out using this variable.

library(dplyr)
df <- data_frame(factor = c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,2),
                 x = floor(runif(15, min=0, max=20)),
                 y = floor(runif(15, min=211, max=305)))
df <- df %>% group_by(factor) %>% mutate(size = length(factor)) %>% arrange(desc(size), factor) %>% 
  ungroup() %>%  mutate(group = ifelse(row_number() %% 4 == 1, "A",
                                       ifelse(row_number() %% 4 == 2, "B",
                                              ifelse(row_number() %% 4 == 3, "C", "D"))))

jblood94 · Answer 2 · 2021-10-23T21:03:54.313

A data.table solution demonstrated with a slightly larger dataset:

library(data.table)

dt <- setorder(data.table(factor = sample(10, 44, TRUE), x = runif(44), y = runif(44)), factor)
numGroups <- 4L
maxFactor <- max(dt$factor)
dt2 <- setorder(
  setorder(
    dt[sample(.N, .N)], # randomly reorder the data
    factor              # sort by factor
  )[, temp := cumsum(.I > 0), by = factor] # create a column to keep a running count of the occurrence of each factor
  [temp <= numGroups]                      # remove rows that can't go in a group due to factor exclusion
  [sample(.N, .N) <= (.N %/% numGroups)*numGroups] # randomly remove excess rows (keep the group sizes equal)
  [, temp := sample(10, 10)[factor]]               # randomly reorder factor groups
  [, grp := c(rep(1:numGroups, .N/numGroups))],    # assign each row a group: row 1 -> group 1, row 2 -> group 2 ... 5 -> 1, 6 -> 2, etc.
  grp # sort by group for table readability
)[, temp := NULL] # remove the temporary column

The resulting data.table will have numGroups groups as indicated by the grp column. Each group will have same number of rows. The number of rows in each group will be the maximum possible in order to meet the no duplicate factor constraint. For a smaller sample, take a subsample of dt2 (see this question).

Randomly subsampling a dataframe without replacements in a specific column with R

2 Answers2