-1

I'd like to combine/pair multiple columns in a data frame as pairs of column cells in the same row. As an example, df1 should be transformed to df2.

df1

col1 col2 col3
1    2    3   
0    0    1

df2

c1  c2
1    2
1    3
2    3
0    0
0    1
0    1

The solution should be scalable for df1s with (way) more than three columns.

I thought about melt/reshape/dcast but found no solution yet. There are no NAs in the data frame. Thank you!

EDIT: Reshape just produced errors, so I thought about

combn(df1[1,], 2) comb2 <- t(comb1)

and looping and appending through all rows. This inefficient, considering 2 million rows..

user5835099
  • 127
  • 1
  • 8

2 Answers2

1

Here's the approach I would take.

Create a function that uses rbindlist from "data.table" and combn from base R. The function looks like this:

lengthener <- function(indf) {
  temp <- rbindlist(
    combn(names(indf), 2, FUN = function(x) indf[x], simplify = FALSE),
    use.names = FALSE, idcol = TRUE)
  setorder(temp[, .id := sequence(.N), by = .id], .id)[, .id := NULL][]
}

Here's the sample data from the other answer, and the application of the function on it:

df1 = as.data.frame(matrix(c(1,2,3,4,0,0,1,1), byrow = TRUE, nrow = 2))

lengthener(df1)
#     V1 V2
#  1:  1  2
#  2:  1  3
#  3:  1  4
#  4:  2  3
#  5:  2  4
#  6:  3  4
#  7:  0  0
#  8:  0  1
#  9:  0  1
# 10:  0  1
# 11:  0  1
# 12:  1  1

Test it out on some larger data too:

set.seed(1)
M <- as.data.frame(matrix(sample(100, 100*100, TRUE), 100))
system.time(out <- lengthener(M))
#    user  system elapsed 
#    0.19    0.00    0.19 
out
#         V1 V2
#      1: 27 66
#      2: 27 27
#      3: 27 68
#      4: 27 66
#      5: 27 56
#     ---      
# 494996: 33 13
# 494997: 33 66
# 494998: 80 13
# 494999: 80 66
# 495000: 13 66

System time for the other approach:

funAMK <- function(indf) {
  nrow_combn = nrow(t(combn(indf[1,], m = 2)))
  nrow_df = nrow(indf) * nrow_combn
  df2 = data.frame(V1 = rep(0, nrow_df), V2 = rep(0, nrow_df))
  for(i in 1:nrow(indf)){
    df2[(((i-1)*nrow_combn)+1):(i*(nrow_combn)), ] = data.frame(t(combn(indf[i,], m = 2)))
  }
  df2
}

> system.time(funAMK(M))
   user  system elapsed 
  16.03    0.16   16.37 
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Thanks, @Ananda Mahto! I've already been working successfully with the other answer above but appreciate your effort! – user5835099 Mar 21 '16 at 03:38
  • @user5835099, no problem. Just FYI, I've added this [as a function](https://github.com/mrdwab/SOfun/blob/master/R/lengthener.R) in [my SOfun package](https://github.com/mrdwab/SOfun). The `lengthener()` function lets you specify the "n" for the `combn` as well. My answer was mostly to show how some more efficiency could be achieved if needed and because Andy wasn't able to get the code working with "data.table". – A5C1D2H2I1M1N2O1R2T1 Mar 21 '16 at 03:41
0

Your edit is very similar to my answer below, you just need to rbind the result each iteration over the rows of df1. Using data.table is a good way to speed up rbind, see this answer for more.

EDIT: Unfortunately, when I switched to the data.table approach, it turned out that the rbindlist() led the answer to be wrong (as pointed out in the comment below). Therefore, although it may be slightly slower, I think that preallocating a data frame and using rbind may be the best option.

EDIT2: switched the preallocated df to a more general number of rows.

df1 = as.data.frame(matrix(c(1,2,3,4,0,0,1,1), byrow = TRUE, nrow = 2))
nrow_combn = nrow(t(combn(df1[1,], m = 2)))
nrow_df = nrow(df1) * nrow_combn
df2 = data.frame(V1 = rep(0, nrow_df), V2 = rep(0, nrow_df))
for(i in 1:nrow(df1)){
  df2[(((i-1)*nrow_combn)+1):(i*(nrow_combn)), ] = data.frame(t(combn(df1[i,], m = 2)))
}
Community
  • 1
  • 1
Andy McKenzie
  • 446
  • 4
  • 12
  • 1
    Thanks for your effort! I've tried your approach but the result is not the desired one: It produces 2 rows and six columns, simply duplicating each cell's element. I don't know why since the basic `t(combn(...` solution yields the right result, but just for one row at a time. – user5835099 Feb 29 '16 at 07:58
  • 1
    Growing an object will likely by slow no matter which tool user uses. – Roman Luštrik Feb 29 '16 at 09:04
  • @RomanLuštrik did you read the link I posted? In that simulation, this approach was faster than pre-allocating a data frame. – Andy McKenzie Feb 29 '16 at 15:13
  • 1
    Thanks for your edit, @AndyMcKenzie! It works for this example, but it is not possible to scale b/c of `nrow_df = ncol(df1) * nrow(df1)`. For instance, having four columns in the initial sample would yield 4 choose 2=6 combinations per row, making it 12 in total, while the code just creates 8. – user5835099 Feb 29 '16 at 23:55
  • EDIT: A fix might be `nrow_df5 <- choose(ncol(df1),2) * nrow(df1)`. I'll try this now. – user5835099 Mar 01 '16 at 00:01
  • @user5835099 Thanks for pointing that out; hopefully the current answer is more general. – Andy McKenzie Mar 01 '16 at 00:03
  • Not sure what you tried with "data.table", but it's certainly doable. – A5C1D2H2I1M1N2O1R2T1 Mar 18 '16 at 13:22