Assuming that @r2evans's sample data is somewhat representative of your problem, here are a few other options to consider.
Option 1: Base R's reshape
Oddly enough, the function named precisely after the task you describe--reshape()
--seems to be the "black sheep" when it comes to reshaping tools. If it's part of your regular toolkit, it's not too difficult to get the hang of though.
For this problem, an approach might be:
reshape(dat, direction = "long", idvar = 1:3, varying = 4:ncol(dat), sep = ".")
My concerns with reshape()
is not its syntax, but (1) it can't handle unbalanced wide-to-long conversions (for example, if you had 3 "Clarity" columns but only 2 "Effort" columns), and (2) it can be painfully slow when you start dealing with lots of rows or lots of columns to be reshaped. As such, I wrote merged.stack
.
Option 2: merged.stack
I wrote merged.stack
as part of my "splitstackshape" package to handle similar reshaping tasks to what reshape(., direction = "long", ...)
would do (which is different, for instance, from what melt
from "reshape"/"reshape2" (and subsequently, gather
, from "tidyr") does). I also wanted to simplify the process of selecting the variables of interest just by identifying the variable "stubs" (in this case, "Clarity" and "Effort").
As mentioned before, merged.stack
was also designed to be fast.
library(splitstackshape)
merged.stack(dat, var.stubs = c("Clarity", "Effort"), sep = ".")
# sheetNum year roleInDebate .time_1 Clarity Effort
# 1: 1 2006 x 1 3 5
# 2: 1 2006 x 2 10 4
# 3: 1 2006 x 3 5 7
# 4: 2 2009 y 1 2 8
# 5: 2 2009 y 2 3 1
# 6: 2 2009 y 3 6 8
# 7: 3 2013 r 1 7 10
# 8: 3 2013 r 2 7 4
# 9: 3 2013 r 3 5 2
# 10: 4 2020 q 1 4 4
# 11: 4 2020 q 2 2 9
# 12: 4 2020 q 3 2 8
# 13: 5 2004 b 1 8 8
# 14: 5 2004 b 2 3 4
# 15: 5 2004 b 3 9 5
Option 3: Wait for melt
from "data.table" version 1.9.8
OK. Well, this may not really be an option (but the "data.table" developers work fast, so who knows) but in version 1.9.8 of "data.table" you would be able to melt
specified lists of columns. See this issue for more details. Or, if you're more adventurous, install the 1.9.8 branch and try it out right away :-)
Eventually, this might make merged.stack
redundant, since it will have similar functionality but even better speed (from what I can see in a few trials I've done).
Update -- benchmarks
I've tested just merged.stack
and @r2evan's approach below. I didn't test reshape()
because I was afraid it would slow my system to a crawl. I didn't test melt
from "data.table" because it would be better to wait for a production release.
Here's some sample data:
set.seed(1)
n <- 100000
r <- 6
dat <- data.frame(
sheetNum = 1:n,
year = sample(2000:2025, size = n, TRUE),
roleInDebate = sample(letters, size = n, replace = TRUE),
matrix(sample(10, n * r * 2, TRUE), nrow = n,
dimnames = list(NULL, paste(c("Clarity", "Effort"),
rep(seq_len(r), each = 2),
sep = "."))))
Here are the two functions tested:
r2evans <- function() {
dat %>%
gather(var, val, -sheetNum, -year, -roleInDebate) %>%
separate(var, c('skill', 'person'), '\\.') %>%
spread(skill, val)
}
ananda <- function() {
merged.stack(dat, var.stubs = c("Clarity", "Effort"), sep = ".")
}
Here are the results on 10 runs:
library(microbenchmark)
microbenchmark(r2evans(), ananda(), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# r2evans() 3514.0961 3603.7102 3839.6097 3713.6705 3959.5320 4380.4601 10
# ananda() 320.5602 336.2396 363.7165 367.3344 386.3064 417.7994 10
And some verification that the output is the same:
out1 <- r2evans()
out2 <- ananda()
library(compare)
compare(out1, out2, allowAll = TRUE)
# TRUE
# renamed
# dropped names
# dropped attributes