Subsets of data frame

Question

I have a data frame with entries in R, and want to create all possible unique subsets from this data frame, when each subset should include a unique possible pairwise combination of two columns from the pool of columns in the original data frame. This means that if the number of columns in the original data frame is Y, the number of unique subsets I should get is Y*(Y-1)/2. I also want that the name of the columns in each subset would be the name that was used in the original data frame. How do I do it?

Hi, welcome to SO. Since you are new here, you might want to read the [**about**](http://stackoverflow.com/about) and [**FAQ**](http://stackoverflow.com/faq) sections of the website to help you get the most out of it. Please also read [**how to make a great reproducible example**](http://stackoverflow.com/q/5963269/1478381) and update your question accordingly! Posted questions where the OP has not shown what they have already attempted and/or the desired output tend to get downvoted or closed. Just warning you for next time. — Simon O'Hanlon, Sep 01 '13 at 13:46
What function is to be applied to each pair of columns to create another column in the new dataframe? — Ferdinand.kraft, Sep 01 '13 at 14:20

score 0 · Answer 1 · answered Sep 01 '13 at 13:35

colpairs <- function(d) {
  apply(combn(ncol(d),2), 2, function(x) d[,x])
}

x <- colpairs(iris)
sapply(x, head, n=2)

## [[1]]
##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0
## 
## [[2]]
##   Sepal.Length Petal.Length
## 1          5.1          1.4
## 2          4.9          1.4
...

score 0 · Answer 2 · answered Sep 01 '13 at 13:36

I'd use combn to make the indices of your columns, and lapply to take subsets of your data.frame and store them in a list structure. e.g.

#  Example data
set.seed(1)
df <- data.frame( a = sample(2,4,repl=T) ,
            b = runif(4) ,
            c = sample(letters ,4 ),
            d = sample( LETTERS , 4 ) )

# Use combn to get indices
ind <- combn( x = 1:ncol(df) , m = 2  , simplify = FALSE )

#  ind is the column indices. The indices returned by the example above are (pairs in columns):     
#[,1] [,2] [,3] [,4] [,5] [,6]
#[1,]    1    1    1    2    2    3
#[2,]    2    3    4    3    4    4

#  Make subsets, combine in list
out <- lapply( ind , function(x) df[,x] )
[[1]]
#  a         b
#1 1 0.2016819
#2 1 0.8983897
#3 2 0.9446753
#4 2 0.6607978

[[2]]
#  a c
#1 1 q
#2 1 b
#3 2 e
#4 2 x

[[3]]
#  a d
#1 1 R
#2 1 J
#3 2 S
#4 2 L

[[4]]
#          b c
#1 0.2016819 q
#2 0.8983897 b
#3 0.9446753 e
#4 0.6607978 x

[[5]]
#          b d
#1 0.2016819 R
#2 0.8983897 J
#3 0.9446753 S
#4 0.6607978 L

[[6]]
#  c d
#1 q R
#2 b J
#3 e S
#4 x L

You don't need `lapply`: `combn( x = 1:ncol(df) , m = 2 , FUN=function(x) df[,x], simplify = FALSE )` — Roland, Sep 01 '13 at 15:25
@Roland thanks, I always forget you can supply a function to `combn`. — Simon O'Hanlon, Sep 01 '13 at 19:21

Subsets of data frame

2 Answers2