Random subset containing at least one instance of each factor

Question

Let's define a data.framedf with 3 columns and 10 rows. The third column is the class and the two first some variables.

var1 <- rnorm(10)
var2 <- rnorm(10,2)
class<- as.factor(c(1,2,3,1,2,1,2,1,3,3))
df   <- data.frame(var1=var1,var2=var2,class=class)

How to randomly subset df in two subsets so that sub.df1 and sub.df2 have at least one instance of each class?

Do you mean partition? Meaning each row will go into one and only one subset? — flodel, Oct 22 '13 at 11:47

flodel · Accepted Answer · 2013-10-22T11:52:37.273

2

This works:

set.seed(123)
partition <- function(x, n = 2) sample(c(1:n, sample(1:n, length(x) - n, TRUE)))
split(df, as.integer(ave(df$class, df$class, FUN = partition)))

# $`1`
#          var1      var2 class
# 4   1.6674610 3.3886789     1
# 7  -0.2245588 0.8284845     2
# 8  -1.1481185 4.1586492     1
# 10 -0.4712463 3.1846324     3
# 
# $`2`
#         var1       var2 class
# 1  0.9884264  3.3487054     1
# 2 -0.1549679 -0.5815586     2
# 3  1.4484692  0.3521933     3
# 5  0.5454097  2.0405363     2
# 6  1.0971626  0.6410492     1
# 9 -1.3042283  3.3235418     3

edited Oct 22 '13 at 11:52

answered Oct 22 '13 at 11:44

flodel

87,577
21
185
223

This will fail if `length(x) - n < 0` e.g: `class<- as.factor(c(1,2,3,1,2,1,2,4,3,3))` – agstudy Oct 22 '13 at 12:10
Have you any suggestion to avoid it? – WAF Oct 22 '13 at 12:15
2

@WAF replace partition by `partition <- function(x, n = 2) {if ((length(x) - n)<0) sample(seq_along(x)) else sample(c(1:n, sample(1:n, length(x) - n, TRUE)))} ` – agstudy Oct 22 '13 at 12:28
1

@agstudy, when you say *"this will fail if..."* I'd like to add *"and it is a good thing"*. Both the OP's description and example implied at least two elements per class. And good code should always error out anytime an assumption is not met. I considered adding `stopifnot(length(x) >= n)` but figured `sample` would error out anyway. And I was ready to follow-up if that was the case. But it would not have been something wrong with my answer, rather the OP's making wrong assumptions about his data and having to modify what he wants. – flodel Oct 22 '13 at 12:54
Also *"this will fail if `df` does not have a `class` column"*, see what I mean? – flodel Oct 22 '13 at 13:17
@flodel (sacré florent :) ) I get your point! Let me just add that good code should fail "in more expressive way" anytime an assumption is not met! +1 ! – agstudy Oct 22 '13 at 13:36

score 0 · Answer 2 · answered Oct 22 '13 at 12:44

A very non elegant way could be:

#changed the data to better check
var1 <- rnorm(21)
var2 <- rnorm(21,2)
class<- as.factor(c(1,2,3,1,2,1,2,1,3,3,4,1,2,4,4,5,1,2,1,2,5))
DF   <- data.frame(var1=var1,var2=var2,class=class)

#order DF by class
DF <- DF[order(DF$class),]                  

#add a column (like a second class) 
#so that every level of the "second class" contains all levels from original class
DF$col4 <- unlist((sapply(table(DF$class), 
                         function(x) { letters[1:x] })), use.names = F)

#order by the "second class"
DF <- DF[order(DF$col4),]  

#one df with all levels of original class
DF1 <- split(DF, DF$col4)$a
#another df with all levels of orignal class
DF2 <- split(DF, DF$col4)$b

#remaining levels of second class contain levels 
#of original class already present in DF1, DF2
#so, just add them to either DF1 or DF2
DF3 <- do.call(rbind, split(DF, DF$col4)[letters[3:max(table(DF$class))]])
DF1 <- rbind(DF1, DF3[1:(nrow(DF3)/2),])
DF2 <- rbind(DF2, DF3[(nrow(DF3)/2):nrow(DF3),])

#remove the second class
DF1 <- DF1[,1:3]
DF2 <- DF2[,1:3]

#> DF1
#            var1       var2 class
#1    -0.32872359  2.0055574     1
#2    -0.93543130  1.9035439     2
#3    -0.04343290  2.9213939     3
#11    1.15724846  3.0646201     4
#16    0.44848508  0.2414504     5
#c.6   0.14438547  3.5265833     1
#c.7   0.31614781  3.7160113     2
#c.10 -0.45882460 -0.2937924     3
#c.15  0.07145533  1.8942732     4
#d.8   0.61422896  1.8690204     1
#> DF2
#            var1      var2 class
#4     0.09097849 1.4701793     1
#5     1.19147818 3.4190744     2
#9    -0.37807035 3.4565437     3
#14   -1.35257981 3.6023453     4
#21   -1.07466815 0.2104640     5
#d.8   0.61422896 1.8690204     1
#d.13 -0.26357372 1.7867625     2
#e.12  0.05694470 3.1141126     1
#e.18 -0.51124304 0.8070597     2
#f.17 -2.94353989 1.6532037     1
#f.20  0.21011089 2.4029225     2

Random subset containing at least one instance of each factor

2 Answers2

Linked