R data.table reshape chunks of columns at once

Question

Lets say I have a data.table with these columns

nodeID   
hour1aaa   
hour1bbb   
hour1ccc   
hour2aaa   
hour2bbb   
hour2ccc   
...   
hour24aaa   
hour24bbb   
hour24ccc

for a total of 72 columns. Let's call it rawtable

I want to reshape it so I have

nodeID
hour
aaa
bbb
ccc

for a total of just these 5 columns where the hour column will contain whichever hour from the original 72 that it should be. Let's call it newshape

The way I'm doing it now is to use rbindlist with 24 items where each item is the proper subset of the bigger data.table. Like this (except I'm leaving out most of the hours in my example)

newshape<-rbindlist(list(
 rawtable[,list(nodeID, Hour=1, aaa=hour1aaa, bbb=hour1bbb, ccc=hour1ccc)], 
 rawtable[,list(nodeID, Hour=2, aaa=hour2aaa, bbb=hour2bbb, ccc=hour2ccc)], 
 rawtable[,list(nodeID, Hour=24, aaa=hour24aaa, bbb=hour24bbb, ccc=hour24ccc)]))

Here is some sample data to play with

rawtable<-data.table(nodeID=c(1,2),hour1aaa=c(12.4,32),hour1bbb=c(61.1,65.33),hour1ccc=c(-4.2,54),hour2aaa=c(12.2,1.2),hour2bbb=c(12.2,5.7),hour2ccc=c(5.6,101.9),hour24aaa=c(45.2,8.5),hour24bbb=c(23,7.9),hour24ccc=c(98,32.3))

Using my rbindlist approach gives the desired result but, as with most things I do with R, there is probably a better way. By better I mean more memory efficient, faster, and/or uses less lines of code. Does anyone have a better way to achieve this?

thelatemail · Answer 1 · 2017-06-21T00:57:20.857

3

This is a classic reshape problem if you get your names in the standard convention it expects, though I'm not sure this really harnesses the efficiency of the data.table structure:

reshape(
  setNames(rawtable, gsub("(\\D+)(\\d+)(\\D+)", "\\3.\\2", names(rawtable))),
  idvar="nodeID", direction="long", varying=-1
)

Result:

   nodeID hour  aaa   bbb   ccc
1:      1    1 12.4 61.10  -4.2
2:      2    1 32.0 65.33  54.0
3:      1    2 12.2 12.20   5.6
4:      2    2  1.2  5.70 101.9
5:      1   24 45.2 23.00  98.0
6:      2   24  8.5  7.90  32.3

@Arun's answer over here: https://stackoverflow.com/a/15510828/496803 may also be useful if you can adapt it to your current data.

edited Jun 21 '17 at 00:57

answered Aug 19 '13 at 23:37

thelatemail

91,185
12
128
188

+1 I don't think that there would be much of a problem with base R reshape. It always bugs me that `reshape` doesn't go to the trouble to figure out the `varying = list(...)` part, especially when you have to specify `v.names` in those cases. – A5C1D2H2I1M1N2O1R2T1 Aug 20 '13 at 02:23
@AnandaMahto - a small price to pay I suppose. You could always do `keynames <- c("aaa","bbb","ccc")` and use `lapply(keynames,grep,names(rawtable))` in the `varying` and `keynames` in the `v.names` part. It is a bit of an inconvenience though. – thelatemail Aug 20 '13 at 02:36
I have defined a vectorized `grep` for such purposes in my functions: `vGrep <- Vectorize(grep, "pattern", SIMPLIFY = FALSE)`. – A5C1D2H2I1M1N2O1R2T1 Aug 20 '13 at 02:46
Thanks for the suggestions. I changed the 2 data points listed above to `runif(1e6,min=-10,max=10)` for the vars and `1:1e6` for `nodeID` then did system.time on reshape. It came back with 5.8 vs `rbindlist` approach of 0.19. This doesn't surprise me as `reshape` isn't very fast in my experience compared to `reshape2` and I'm not aware of an analog in `reshape2` for the `times/timevar` argument in `reshape`. – Dean MacGregor Aug 20 '13 at 15:02

A5C1D2H2I1M1N2O1R2T1 · Answer 2 · 2013-09-09T16:43:25.893

One option is to use merged.stack from my package "splitstackshape". This function, stacks groups of columns and then merges the output together. Because of how the function creates the "time" variable, you can specify whatever you wanted to strip out from the column names. In this case, we want to strip out "hour", "aaa", "bbb", and "ccc" and have just the numbers remaining.

library(splitstackshape)
## Make sure you're using at least 1.2.0
packageVersion("splitstackshape")
# [1] ‘1.2.0’
merged.stack(rawtable, id.vars="nodeID", 
             var.stubs=c("aaa", "bbb", "ccc"), 
             sep="hour|aaa|bbb|ccc")
#    nodeID .time_1  aaa   bbb   ccc
# 1:      1       1 12.4 61.10  -4.2
# 2:      1       2 12.2 12.20   5.6
# 3:      1      24 45.2 23.00  98.0
# 4:      2       1 32.0 65.33  54.0
# 5:      2       2  1.2  5.70 101.9
# 6:      2      24  8.5  7.90  32.3

R data.table reshape chunks of columns at once

2 Answers2