data.table reshape with alternating columns

Question

I have a data frame with alternating columns that I want to reshape. The problem is that stats::reshape and reshape2::reshape are both very slow and memory intensive on my actual use case. I suspect that the no-copy approach of data.table will save me time and use less resources, but I barely know where to start with the syntax (previous related efforts 1, 2).

Here's an example of how my data frame is structured:

set.seed(4)
dt <- data.frame(names = letters[1:10],
       one = rep(23,10), 
       two = sample(1000,10),
       three = sample(10,10),
       onea = rep(24,10), 
       twoa = sample(1000,10),
       threea = sample(10,10),
       oneb = rep(25,10), 
       twob = sample(1000,10),
       threeb = sample(10,10),
       onec = rep(26,10), 
       twoc = sample(1000,10),
       threec = sample(10,10), 
       oned = rep(26,10), 
       twod = sample(1000,10),
       threed = sample(10,10))

Which looks like this:

   names one two three onea twoa threea oneb twob threeb onec twoc threec oned
1      a  23 586     8   24  715      6   25  939      4   26  561      4   26
2      b  23   9     3   24  996      3   25  242      7   26   72      6   26
3      c  23 294     1   24  506      8   25  565      8   26  852     10   26
4      d  23 277     7   24  489      5   25  181      6   26  911      3   26
5      e  23 811     9   24  647      9   25  901      5   26  225      5   26
6      f  23 260     6   24  827      7   25   84      3   26  626      8   26
7      g  23 721     4   24  480      2   25  896      1   26   69      2   26
8      h  23 900     2   24  836      4   25  886     10   26  512      9   26
9      i  23 942     5   24  510      1   25  718      2   26  799      1   26
10     j  23  73    10   24  526     10   25  560      9   26  964      7   26
   twod threed
1   911      2
2   709     10
3   571      5
4   915      9
5   899      3
6    59      1
7    46      4
8   982      7
9   205      8
10  921      6

Here's what I'm currently doing with stats::reshape which takes a long time and uses a lot of memory on my actual use case:

df_l <- stats::reshape(dt, idvar='names',
                   varying=list(ones = colnames(dt[seq(from=2, 
                                to=ncol(dt), by=3)]), 
                                twos = colnames(dt[seq(from=4, 
                                to=ncol(dt), by=3)])), 
                   direction="long")

Here's the desired output (I don't care about any of the three columns):

 df_l
    names two twoa twob twoc twod time one three
a.1     a 586  715  939  561  911    1  23     8
b.1     b   9  996  242   72  709    1  23     3
c.1     c 294  506  565  852  571    1  23     1
d.1     d 277  489  181  911  915    1  23     7
e.1     e 811  647  901  225  899    1  23     9
f.1     f 260  827   84  626   59    1  23     6
g.1     g 721  480  896   69   46    1  23     4
h.1     h 900  836  886  512  982    1  23     2
i.1     i 942  510  718  799  205    1  23     5
j.1     j  73  526  560  964  921    1  23    10
a.2     a 586  715  939  561  911    2  24     6
b.2     b   9  996  242   72  709    2  24     3
c.2     c 294  506  565  852  571    2  24     8
d.2     d 277  489  181  911  915    2  24     5
e.2     e 811  647  901  225  899    2  24     9
f.2     f 260  827   84  626   59    2  24     7
g.2     g 721  480  896   69   46    2  24     2
h.2     h 900  836  886  512  982    2  24     4
i.2     i 942  510  718  799  205    2  24     1
j.2     j  73  526  560  964  921    2  24    10
a.3     a 586  715  939  561  911    3  25     4
b.3     b   9  996  242   72  709    3  25     7
c.3     c 294  506  565  852  571    3  25     8
d.3     d 277  489  181  911  915    3  25     6
e.3     e 811  647  901  225  899    3  25     5
f.3     f 260  827   84  626   59    3  25     3
g.3     g 721  480  896   69   46    3  25     1
h.3     h 900  836  886  512  982    3  25    10
i.3     i 942  510  718  799  205    3  25     2
j.3     j  73  526  560  964  921    3  25     9
a.4     a 586  715  939  561  911    4  26     4
b.4     b   9  996  242   72  709    4  26     6
c.4     c 294  506  565  852  571    4  26    10
d.4     d 277  489  181  911  915    4  26     3
e.4     e 811  647  901  225  899    4  26     5
f.4     f 260  827   84  626   59    4  26     8
g.4     g 721  480  896   69   46    4  26     2
h.4     h 900  836  886  512  982    4  26     9
i.4     i 942  510  718  799  205    4  26     1
j.4     j  73  526  560  964  921    4  26     7
a.5     a 586  715  939  561  911    5  26     2
b.5     b   9  996  242   72  709    5  26    10
c.5     c 294  506  565  852  571    5  26     5
d.5     d 277  489  181  911  915    5  26     9
e.5     e 811  647  901  225  899    5  26     3
f.5     f 260  827   84  626   59    5  26     1
g.5     g 721  480  896   69   46    5  26     4
h.5     h 900  836  886  512  982    5  26     7
i.5     i 942  510  718  799  205    5  26     8
j.5     j  73  526  560  964  921    5  26     6

How can I do this with data.table?

There's no such thing as `reshape2::reshape` - do you mean `reshape2::melt`? — hadley, Mar 20 '14 at 12:54
Yes, you're right, thanks. Is a `dreshape2` in the works (or `reshape3` or whatever)? I mean a `reshape2` with the amazing C++ speed of `dplyr`? — Ben, Mar 20 '14 at 17:47

CHP · Accepted Answer · 2014-03-20T11:01:46.813

For this very specific case I think following approach might be faster.

You create a list of data.tables with specific columns that you need to combine vertically (i.e. ones and twos). That is accomplished by lapply(seq(3,15,3), function(j) { DT[, c(1, j-1,j+1), with=F ]})

and you use rbindlist which is a faster version of rbind which joins data.tables or data.frames together without checking the column orders etc. (which is what makes it fast).

Finally you merge the resultant data.table DT2 back with part of original data.table DT which you want repeated. i.e DT1 which has only columns names, two, twoa, twob, twoc, twod from the original data.table

DT <- data.table(dt)

DT1 <- DT[, list(names, two, twoa, twob, twoc, twod)]

DT2 <- rbindlist(lapply(seq(3,15,3), function(j) { DT[, c(1, j-1,j+1), with=F ]}))

setkey(DT1, names)
setkey(DT2, names)

RES <- DT1[DT2]

RES
##     names two twoa twob twoc twod one three
##  1:     a 586  715  939  561  911  23     8
##  2:     a 586  715  939  561  911  24     6
##  3:     a 586  715  939  561  911  25     4
##  4:     a 586  715  939  561  911  26     4
##  5:     a 586  715  939  561  911  26     2
##  6:     b   9  996  242   72  709  23     3
##  7:     b   9  996  242   72  709  24     3
##  8:     b   9  996  242   72  709  25     7
##  9:     b   9  996  242   72  709  26     6
## 10:     b   9  996  242   72  709  26    10
## 11:     c 294  506  565  852  571  23     1
## 12:     c 294  506  565  852  571  24     8
## 13:     c 294  506  565  852  571  25     8
## 14:     c 294  506  565  852  571  26    10
## 15:     c 294  506  565  852  571  26     5
## 16:     d 277  489  181  911  915  23     7
## 17:     d 277  489  181  911  915  24     5
## 18:     d 277  489  181  911  915  25     6
## 19:     d 277  489  181  911  915  26     3
## 20:     d 277  489  181  911  915  26     9
## 21:     e 811  647  901  225  899  23     9
## 22:     e 811  647  901  225  899  24     9
## 23:     e 811  647  901  225  899  25     5
## 24:     e 811  647  901  225  899  26     5
## 25:     e 811  647  901  225  899  26     3
## 26:     f 260  827   84  626   59  23     6
## 27:     f 260  827   84  626   59  24     7
## 28:     f 260  827   84  626   59  25     3
## 29:     f 260  827   84  626   59  26     8
## 30:     f 260  827   84  626   59  26     1
## 31:     g 721  480  896   69   46  23     4
## 32:     g 721  480  896   69   46  24     2
## 33:     g 721  480  896   69   46  25     1
## 34:     g 721  480  896   69   46  26     2
## 35:     g 721  480  896   69   46  26     4
## 36:     h 900  836  886  512  982  23     2
## 37:     h 900  836  886  512  982  24     4
## 38:     h 900  836  886  512  982  25    10
## 39:     h 900  836  886  512  982  26     9
## 40:     h 900  836  886  512  982  26     7
## 41:     i 942  510  718  799  205  23     5
## 42:     i 942  510  718  799  205  24     1
## 43:     i 942  510  718  799  205  25     2
## 44:     i 942  510  718  799  205  26     1
## 45:     i 942  510  718  799  205  26     8
## 46:     j  73  526  560  964  921  23    10
## 47:     j  73  526  560  964  921  24    10
## 48:     j  73  526  560  964  921  25     9
## 49:     j  73  526  560  964  921  26     7
## 50:     j  73  526  560  964  921  26     6
##     names two twoa twob twoc twod one three

Another way is to just use data.table:::melt.data.table function twice (once for ones and another time for threes and cbind the results together

RES <- cbind(data.table:::melt.data.table(DT, id.vars=c(1,seq(3,15,3)), 
       measure.vars = seq(2,14,3), value.name='one', variable.name="onevar")[,-7, with=F], 
       data.table:::melt.data.table(DT, id.vars=c(1), measure.vars = seq(4,16,3),
       value.name='three', variable.name="threevar")[, 3, with=F])

RES
##     names two twoa twob twoc twod one three
##  1:     a 586  715  939  561  911  23     8
##  2:     a 586  715  939  561  911  24     6
##  3:     a 586  715  939  561  911  25     4
##  4:     a 586  715  939  561  911  26     4
##  5:     a 586  715  939  561  911  26     2
##  6:     b   9  996  242   72  709  23     3
##  7:     b   9  996  242   72  709  24     3
##  8:     b   9  996  242   72  709  25     7
##  9:     b   9  996  242   72  709  26     6
## 10:     b   9  996  242   72  709  26    10
## 11:     c 294  506  565  852  571  23     1
## 12:     c 294  506  565  852  571  24     8
## 13:     c 294  506  565  852  571  25     8
## 14:     c 294  506  565  852  571  26    10
## 15:     c 294  506  565  852  571  26     5
## 16:     d 277  489  181  911  915  23     7
## 17:     d 277  489  181  911  915  24     5
## 18:     d 277  489  181  911  915  25     6
## 19:     d 277  489  181  911  915  26     3
## 20:     d 277  489  181  911  915  26     9
## 21:     e 811  647  901  225  899  23     9
## 22:     e 811  647  901  225  899  24     9
## 23:     e 811  647  901  225  899  25     5
## 24:     e 811  647  901  225  899  26     5
## 25:     e 811  647  901  225  899  26     3
## 26:     f 260  827   84  626   59  23     6
## 27:     f 260  827   84  626   59  24     7
## 28:     f 260  827   84  626   59  25     3
## 29:     f 260  827   84  626   59  26     8
## 30:     f 260  827   84  626   59  26     1
## 31:     g 721  480  896   69   46  23     4
## 32:     g 721  480  896   69   46  24     2
## 33:     g 721  480  896   69   46  25     1
## 34:     g 721  480  896   69   46  26     2
## 35:     g 721  480  896   69   46  26     4
## 36:     h 900  836  886  512  982  23     2
## 37:     h 900  836  886  512  982  24     4
## 38:     h 900  836  886  512  982  25    10
## 39:     h 900  836  886  512  982  26     9
## 40:     h 900  836  886  512  982  26     7
## 41:     i 942  510  718  799  205  23     5
## 42:     i 942  510  718  799  205  24     1
## 43:     i 942  510  718  799  205  25     2
## 44:     i 942  510  718  799  205  26     1
## 45:     i 942  510  718  799  205  26     8
## 46:     j  73  526  560  964  921  23    10
## 47:     j  73  526  560  964  921  24    10
## 48:     j  73  526  560  964  921  25     9
## 49:     j  73  526  560  964  921  26     7
## 50:     j  73  526  560  964  921  26     6
##     names two twoa twob twoc twod one three

Excellent, thanks very much. Your first solution is easily adaptable to my actual use case (adding `RES[order(one)]` etc.) and has cut the time from 15.52 sec to 0.47 sec, a nice 15x increase in speed. — Ben, Mar 20 '14 at 18:11

score 2 · Answer 2 · edited Mar 21 '14 at 17:04

Since version 1.8.11, data.table supports melt in reshape2. and also dcast.data.table. You can check manual for more examples and details.

This problem can be solved by selecting based on substring match.

require(reshape2)
require(data.table)
dt <- data.table(dt)
dt_melt <- melt(dt, id = c(1,3,6,9,12,15))
a <- dt_melt[like(variable,"one"), ]
b <- dt_melt[like(variable,"three"), value]
c <- a[, three:= b][, time := ceiling(.I/10)][, variable := NULL]
setnames(c, "value", "one")

The logic here is quiet simple, first melt, and then select based on substring match.

I don't know whether you real data have such pattern such as one, and three. Maybe there need some modifications if not.

Here's the result.

     names two twoa twob twoc twod variable   one three time
 1:     a 586  715  939  561  911      one    23     8    1
 2:     b   9  996  242   72  709      one    23     3    1
 3:     c 294  506  565  852  571      one    23     1    1
 4:     d 277  489  181  911  915      one    23     7    1
 5:     e 811  647  901  225  899      one    23     9    1
 6:     f 260  827   84  626   59      one    23     6    1
 7:     g 721  480  896   69   46      one    23     4    1
 8:     h 900  836  886  512  982      one    23     2    1
 9:     i 942  510  718  799  205      one    23     5    1
10:     j  73  526  560  964  921      one    23    10    1
11:     a 586  715  939  561  911     onea    24     6    2
12:     b   9  996  242   72  709     onea    24     3    2
13:     c 294  506  565  852  571     onea    24     8    2
14:     d 277  489  181  911  915     onea    24     5    2
15:     e 811  647  901  225  899     onea    24     9    2
16:     f 260  827   84  626   59     onea    24     7    2
17:     g 721  480  896   69   46     onea    24     2    2
18:     h 900  836  886  512  982     onea    24     4    2
19:     i 942  510  718  799  205     onea    24     1    2
20:     j  73  526  560  964  921     onea    24    10    2
21:     a 586  715  939  561  911     oneb    25     4    3
22:     b   9  996  242   72  709     oneb    25     7    3
23:     c 294  506  565  852  571     oneb    25     8    3
24:     d 277  489  181  911  915     oneb    25     6    3
25:     e 811  647  901  225  899     oneb    25     5    3
26:     f 260  827   84  626   59     oneb    25     3    3
27:     g 721  480  896   69   46     oneb    25     1    3
28:     h 900  836  886  512  982     oneb    25    10    3
29:     i 942  510  718  799  205     oneb    25     2    3
30:     j  73  526  560  964  921     oneb    25     9    3
31:     a 586  715  939  561  911     onec    26     4    4
32:     b   9  996  242   72  709     onec    26     6    4
33:     c 294  506  565  852  571     onec    26    10    4
34:     d 277  489  181  911  915     onec    26     3    4
35:     e 811  647  901  225  899     onec    26     5    4
36:     f 260  827   84  626   59     onec    26     8    4
37:     g 721  480  896   69   46     onec    26     2    4
38:     h 900  836  886  512  982     onec    26     9    4
39:     i 942  510  718  799  205     onec    26     1    4
40:     j  73  526  560  964  921     onec    26     7    4
41:     a 586  715  939  561  911     oned    26     2    5
42:     b   9  996  242   72  709     oned    26    10    5
43:     c 294  506  565  852  571     oned    26     5    5
44:     d 277  489  181  911  915     oned    26     9    5
45:     e 811  647  901  225  899     oned    26     3    5
46:     f 260  827   84  626   59     oned    26     1    5
47:     g 721  480  896   69   46     oned    26     4    5
48:     h 900  836  886  512  982     oned    26     7    5
49:     i 942  510  718  799  205     oned    26     8    5
50:     j  73  526  560  964  921     oned    26     6    5
    names two twoa twob twoc twod variable value three time

Thanks for taking a look, I'm getting `Error in is.factor(vector) : object 'variable' not found` for your lines that creates `a`, `b` and `c`. Any thoughts? — Ben, Mar 20 '14 at 17:45
oh，you need to convert your data.frame into data.table first. — Bigchao, Mar 21 '14 at 02:37

data.table reshape with alternating columns

2 Answers2