The issue here is that melt()
doesn't know how to name the variables in case of more than one measure variable. So, it resorts to simply numbering the variables.
David has pointed out that there is a feature request. However, I will show two workarounds and compare them (plus the tidyr
answer) in terms of speed.
- The first approach is to
melt()
all measure variables (which keeps the variable names), create new variable names, and dcast()
the temporary result again to end up with two value columns. This recast approach is also being used by austensen.
- The second approach is what the OP is asking for (melting two value columns simultaneously) but includes a simple way to rename the variables afterwards.
Recast
library(data.table) # CRAN version 1.10.4 used
# melt all measure variables
long <- melt(df, id.vars = "id")
# split variables names
long[, c("CapitalChargeType", "age") :=
tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)]
dcast(long, id + CapitalChargeType ~ age)
id CapitalChargeType New old
1: 1 Credit_risk_Capital 204.85227 327.57606
2: 1 NameConcentration 34.20043 104.14524
3: 2 Credit_risk_Capital 206.96769 416.64575
4: 2 NameConcentration 30.46721 95.25282
5: 3 Credit_risk_Capital 201.85514 465.06647
---
196: 98 NameConcentration 45.38833 90.34097
197: 99 Credit_risk_Capital 203.53625 458.37501
198: 99 NameConcentration 40.14643 101.62655
199: 100 Credit_risk_Capital 203.19156 527.26703
200: 100 NameConcentration 30.83511 79.21762
Note that the variable names are split at the last _
before the final old
or New
, resp. This is achieved by using a regular expression with positive look-ahead: "_(?=(New|old)$)"
Melting two columns and renaming variables
Here, we pick-up David's suggestion to use the patterns()
function which is equivalent to specify a list of measure variables.
As a side note: The order of the list (or the patterns) determines the order of the value columns:
melt(df, measure.vars = patterns("New$", "old$"))
id variable value1 value2
1: 1 1 204.85227 327.57606
2: 2 1 206.96769 416.64575
3: 3 1 201.85514 465.06647
...
melt(df, measure.vars = patterns("old$", "New$"))
id variable value1 value2
1: 1 1 327.57606 204.85227
2: 2 1 416.64575 206.96769
3: 3 1 465.06647 201.85514
...
As already pointed out by the OP, melting with multiple measure variables
long <- melt(df, measure.vars = patterns("old$", "New$"),
variable.name = "CapitalChargeType",
value.name = c("old", "New"))
returns numbers instead of the variable names:
str(long)
Classes ‘data.table’ and 'data.frame': 200 obs. of 4 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ CapitalChargeType: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ old : num 328 417 465 259 426 ...
$ New : num 205 207 202 207 203 ...
- attr(*, ".internal.selfref")=<externalptr>
Fortunately, these are factors which can be changed easily by replacing the factor levels with help of the forcats
package:
long[, CapitalChargeType := forcats::lvls_revalue(
CapitalChargeType,
c("Credit_risk_Capital", "NameConcentration"))]
long[order(id)]
id CapitalChargeType old New
1: 1 Credit_risk_Capital 327.57606 204.85227
2: 1 NameConcentration 104.14524 34.20043
3: 2 Credit_risk_Capital 416.64575 206.96769
4: 2 NameConcentration 95.25282 30.46721
5: 3 Credit_risk_Capital 465.06647 201.85514
---
196: 98 NameConcentration 90.34097 45.38833
197: 99 Credit_risk_Capital 458.37501 203.53625
198: 99 NameConcentration 101.62655 40.14643
199: 100 Credit_risk_Capital 527.26703 203.19156
200: 100 NameConcentration 79.21762 30.83511
Note that melt()
numbers the variables in the order the columns appear in df
.
reshape()
The stats
package of base R has a reshape()
function. Unfortunately, it doesn't accept regular expressions with positive look-ahead. So, the automatic guessing of variable names can't be used. Instead, all relevant parameters have to be specified explicitely:
old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
reshape(df, varying = list(old, new), direction = "long",
timevar = "CapitalChargeType",
times = c("Credit_risk_Capital", "NameConcentration"),
v.names = c("old", "New"))
id CapitalChargeType old New
1: 1 Credit_risk_Capital 367.95567 194.93598
2: 2 Credit_risk_Capital 467.98061 215.39663
3: 3 Credit_risk_Capital 363.75586 201.72794
4: 4 Credit_risk_Capital 433.45070 191.64176
5: 5 Credit_risk_Capital 408.55776 193.44071
---
196: 96 NameConcentration 93.67931 47.85263
197: 97 NameConcentration 101.32361 46.94047
198: 98 NameConcentration 104.80926 33.67270
199: 99 NameConcentration 101.33178 32.28041
200: 100 NameConcentration 85.37136 63.57817
Benchmarking
The benchmark includes all 4 approaches discussed so far:
tidyr
with modifications to use resgular expressions with positive look-ahead,
recast
,
melt()
of multiple value variables, and
reshape()
.
The benchmark data consist of 100 K rows:
n_rows <- 100L
set.seed(1234L)
df <- data.table(
id = c(1:n_rows),
Credit_risk_Capital_old = rnorm(n_rows, mean = 400, sd = 60),
NameConcentration_old = rnorm(n_rows, mean = 100, sd = 10),
Credit_risk_Capital_New = rnorm(n_rows, mean = 200, sd = 10),
NameConcentration_New = rnorm(n_rows, mean = 40, sd = 10))
For benchmarking, the microbenchmark
package is used:
library(magrittr)
old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
microbenchmark::microbenchmark(
tidyr = {
r_tidyr <- df %>%
dplyr::as_data_frame() %>%
tidyr::gather("key", "value", -id) %>%
tidyr::separate(key, c("CapitalChargeType", "age"), sep = "_(?=(New|old)$)") %>%
tidyr::spread(age, value)
},
recast = {
r_recast <- dcast(
melt(df, id.vars = "id")[
, c("CapitalChargeType", "age") :=
tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)],
id + CapitalChargeType ~ age)
},
m2col = {
r_m2col <- melt(df, measure.vars = patterns("New$", "old$"),
variable.name = "CapitalChargeType",
value.name = c("New", "old"))[
, CapitalChargeType := forcats::lvls_revalue(
CapitalChargeType,
c("Credit_risk_Capital", "NameConcentration"))][order(id)]
},
reshape = {
r_reshape <- reshape(df, varying = list(new, old), direction = "long",
timevar = "CapitalChargeType",
times = c("Credit_risk_Capital", "NameConcentration"),
v.names = c("New", "old")
)
},
times = 10L
)
Unit: milliseconds
expr min lq mean median uq max neval
tidyr 705.20364 789.63010 832.11391 813.08830 825.15259 1091.3188 10
recast 215.35813 223.60715 287.28034 261.23333 338.36813 477.3355 10
m2col 10.28721 11.35237 38.72393 14.46307 23.64113 154.3357 10
reshape 143.75546 171.68592 379.05752 224.13671 269.95301 1730.5892 10
The timings show that melt()
of two columns simultaneously is about 15 times faster than the second fastest, reshape()
. Both recast
variants fall behind because they both require two reshaping operations. The tidyr
solution is particularly slow.