I have a dataframe with this structure:
> df
factor y x
1 2 0
1 3 0
1 1 0
1 2 0
2 3 0
2 1 0
2 3 1
3 4 1
3 3 1
3 6 3
3 5 2
4 4 1
4 7 8
4 2 1
2 5 3
In the actual dataset, I have 200 rows and different variables: several continuous variables and a factor variable with 70 levels with up to 4 observations each.
I would like to randomly subsample my entire dataframe into 4 groups of equal size without replacements within each group exclusively in the factor variable. In other words, I would like to have each level of the factor variable occurring not more than once per group.
I've tried different solutions. For instance, I tried by sampling the "factor" variable into four groups without replacements as follows:
factor1 <- as.character(df$factor)
set.seed(123)
group1 <- sample(factor, 35,replace = FALSE)
factor2 <- setdiff(factor1, group1)
group2 <- sample(factor2, 35,replace = FALSE)
# and the same for "group3" and "group4"
but then I don't know how to associate the group vectors (group1, group2, etc.) to the other variables in my df ('x' and 'y').
I've also tried with:
group1 <- sample_n(df, 35, replace = FALSE)
but this solution fails as well since my dataframe doesn't include duplicated rows. The only duplicated values are in the factor variable.
Finally, I tried to use the solution proposed in reply to a similar question here, adapted to my case:
random.groups <- function(n.items = 200L, n.groups = 4L,
factor = rep(1L, n.items)) {
splitted.items <- split(seq.int(n.items), factor)
shuffled <- lapply(splitted.items, sample)
1L + (order(unlist(shuffled)) %% n.groups)
}
df$groups <- random.groups(nrow(df), n.groups = 4)
However, the resulting 4 groups include duplicated values for the factor variable, so something is not working properly.
I would really appreciate any idea or suggestion to solve this problem!