4

Update: I should have been clearer that I was trying to check out the Enhanced functionality in reshaping using data.tables https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html. Updated the title.

I have this data set with two sets of variables - Credit_Risk_Capital and Name_concentration. They are calculated per 2 methodologies - New and Old. When I melt them using the data.table package, the variable names default to 1 and 2. How can I change them to just Credit_Risk_Capital and Name_Concentration.

Here is the data set

    df <-data.table (id = c(1:100),Credit_risk_Capital_old= rnorm(100, mean = 400, sd = 60),
             NameConcentration_old= rnorm(100, mean = 100, sd = 10),
             Credit_risk_Capital_New =rnorm(100, mean = 200, sd = 10),
             NameConcentration_New = rnorm(100, mean = 40, sd = 10))
    old <- c('Credit_risk_Capital_old','NameConcentration_old')
   new<-c('Credit_risk_Capital_New','NameConcentration_New')
  t1<-melt(df, measure.vars = list(old,new), variable.name = "CapitalChargeType",value.name = c("old","new"))

Now instead of the elements in the CapitalChargeType Column getting tagged as 1's and 2's, I want them to be changed to Credit_risk_Capital and NameConcentration. I can obviously change them in a subsequent step using a 'match' function, but is there anyway I can do it within melt itself.

ashleych
  • 1,042
  • 8
  • 25
  • 1
    This is an [open issue on GH](https://github.com/Rdatatable/data.table/issues/1547). See [here](https://stackoverflow.com/questions/41883573/convert-numeric-representation-of-variable-column-to-original-string-following/41884029#41884029) a temporary solution. Finally, no need in prespecifying `old` and `new` vars, just use `patterns` instead, e.g. `melt(df, measure.vars = patterns("old", "New"), variable.name = "CapitalChargeType", value.name = c("old", "new"))` – David Arenburg Jun 01 '17 at 12:09

3 Answers3

14

The issue here is that melt() doesn't know how to name the variables in case of more than one measure variable. So, it resorts to simply numbering the variables.

David has pointed out that there is a feature request. However, I will show two workarounds and compare them (plus the tidyr answer) in terms of speed.

  1. The first approach is to melt() all measure variables (which keeps the variable names), create new variable names, and dcast() the temporary result again to end up with two value columns. This recast approach is also being used by austensen.
  2. The second approach is what the OP is asking for (melting two value columns simultaneously) but includes a simple way to rename the variables afterwards.

Recast

library(data.table)   # CRAN version 1.10.4 used
# melt all measure variables
long <- melt(df, id.vars = "id")
# split variables names
long[, c("CapitalChargeType", "age") := 
       tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)] 
dcast(long, id + CapitalChargeType ~ age)
      id   CapitalChargeType       New       old
  1:   1 Credit_risk_Capital 204.85227 327.57606
  2:   1   NameConcentration  34.20043 104.14524
  3:   2 Credit_risk_Capital 206.96769 416.64575
  4:   2   NameConcentration  30.46721  95.25282
  5:   3 Credit_risk_Capital 201.85514 465.06647
 ---                                            
196:  98   NameConcentration  45.38833  90.34097
197:  99 Credit_risk_Capital 203.53625 458.37501
198:  99   NameConcentration  40.14643 101.62655
199: 100 Credit_risk_Capital 203.19156 527.26703
200: 100   NameConcentration  30.83511  79.21762

Note that the variable names are split at the last _ before the final old or New, resp. This is achieved by using a regular expression with positive look-ahead: "_(?=(New|old)$)"

Melting two columns and renaming variables

Here, we pick-up David's suggestion to use the patterns() function which is equivalent to specify a list of measure variables.

As a side note: The order of the list (or the patterns) determines the order of the value columns:

melt(df, measure.vars = patterns("New$", "old$"))
      id variable    value1    value2
  1:   1        1 204.85227 327.57606
  2:   2        1 206.96769 416.64575
  3:   3        1 201.85514 465.06647
  ...
melt(df, measure.vars = patterns("old$", "New$"))
      id variable    value1    value2
  1:   1        1 327.57606 204.85227
  2:   2        1 416.64575 206.96769
  3:   3        1 465.06647 201.85514
  ...

As already pointed out by the OP, melting with multiple measure variables

long <- melt(df, measure.vars = patterns("old$", "New$"), 
     variable.name = "CapitalChargeType",
     value.name = c("old", "New")) 

returns numbers instead of the variable names:

str(long)
Classes ‘data.table’ and 'data.frame':    200 obs. of  4 variables:
 $ id               : int  1 2 3 4 5 6 7 8 9 10 ...
 $ CapitalChargeType: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ old              : num  328 417 465 259 426 ...
 $ New              : num  205 207 202 207 203 ...
 - attr(*, ".internal.selfref")=<externalptr>

Fortunately, these are factors which can be changed easily by replacing the factor levels with help of the forcats package:

long[, CapitalChargeType := forcats::lvls_revalue(
  CapitalChargeType, 
  c("Credit_risk_Capital", "NameConcentration"))]
long[order(id)]
      id   CapitalChargeType       old       New
  1:   1 Credit_risk_Capital 327.57606 204.85227
  2:   1   NameConcentration 104.14524  34.20043
  3:   2 Credit_risk_Capital 416.64575 206.96769
  4:   2   NameConcentration  95.25282  30.46721
  5:   3 Credit_risk_Capital 465.06647 201.85514
 ---                                            
196:  98   NameConcentration  90.34097  45.38833
197:  99 Credit_risk_Capital 458.37501 203.53625
198:  99   NameConcentration 101.62655  40.14643
199: 100 Credit_risk_Capital 527.26703 203.19156
200: 100   NameConcentration  79.21762  30.83511

Note that melt() numbers the variables in the order the columns appear in df.

reshape()

The stats package of base R has a reshape() function. Unfortunately, it doesn't accept regular expressions with positive look-ahead. So, the automatic guessing of variable names can't be used. Instead, all relevant parameters have to be specified explicitely:

old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
reshape(df, varying = list(old, new), direction = "long", 
        timevar = "CapitalChargeType",
        times = c("Credit_risk_Capital", "NameConcentration"),
        v.names = c("old", "New"))
      id   CapitalChargeType       old       New
  1:   1 Credit_risk_Capital 367.95567 194.93598
  2:   2 Credit_risk_Capital 467.98061 215.39663
  3:   3 Credit_risk_Capital 363.75586 201.72794
  4:   4 Credit_risk_Capital 433.45070 191.64176
  5:   5 Credit_risk_Capital 408.55776 193.44071
 ---                                            
196:  96   NameConcentration  93.67931  47.85263
197:  97   NameConcentration 101.32361  46.94047
198:  98   NameConcentration 104.80926  33.67270
199:  99   NameConcentration 101.33178  32.28041
200: 100   NameConcentration  85.37136  63.57817

Benchmarking

The benchmark includes all 4 approaches discussed so far:

  • tidyr with modifications to use resgular expressions with positive look-ahead,
  • recast,
  • melt() of multiple value variables, and
  • reshape().

The benchmark data consist of 100 K rows:

n_rows <- 100L
set.seed(1234L)
df <- data.table(
  id = c(1:n_rows),
  Credit_risk_Capital_old = rnorm(n_rows, mean = 400, sd = 60),
  NameConcentration_old = rnorm(n_rows, mean = 100, sd = 10),
  Credit_risk_Capital_New = rnorm(n_rows, mean = 200, sd = 10),
  NameConcentration_New = rnorm(n_rows, mean = 40, sd = 10))

For benchmarking, the microbenchmark package is used:

library(magrittr)
old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
microbenchmark::microbenchmark(
  tidyr = {
    r_tidyr <- df %>% 
      dplyr::as_data_frame() %>%  
      tidyr::gather("key", "value", -id) %>% 
      tidyr::separate(key, c("CapitalChargeType", "age"), sep = "_(?=(New|old)$)") %>% 
      tidyr::spread(age, value)
  },
  recast = {
    r_recast <- dcast(
      melt(df, id.vars = "id")[
        , c("CapitalChargeType", "age") := 
          tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)], 
      id + CapitalChargeType ~ age)
  },
  m2col = {
    r_m2col <- melt(df, measure.vars = patterns("New$", "old$"), 
                    variable.name = "CapitalChargeType",
                    value.name = c("New", "old"))[
                      , CapitalChargeType := forcats::lvls_revalue(
                        CapitalChargeType, 
                        c("Credit_risk_Capital", "NameConcentration"))][order(id)]
  },
  reshape = {
    r_reshape <- reshape(df, varying = list(new, old), direction = "long", 
                         timevar = "CapitalChargeType",
                         times = c("Credit_risk_Capital", "NameConcentration"),
                         v.names = c("New", "old")
    )
  },
  times = 10L
)
Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval
   tidyr 705.20364 789.63010 832.11391 813.08830 825.15259 1091.3188    10
  recast 215.35813 223.60715 287.28034 261.23333 338.36813  477.3355    10
   m2col  10.28721  11.35237  38.72393  14.46307  23.64113  154.3357    10
 reshape 143.75546 171.68592 379.05752 224.13671 269.95301 1730.5892    10

The timings show that melt() of two columns simultaneously is about 15 times faster than the second fastest, reshape(). Both recast variants fall behind because they both require two reshaping operations. The tidyr solution is particularly slow.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • 1
    You say that "`melt()` numbers the variables in the order the columns appear in `df`". Why can you claim that? I cannot find anything in the docs. – Ott Toomet Nov 15 '19 at 10:48
  • Hi @Uwe! I just wanted to let you know that I have updated my answer to [Convert numeric representation of 'variable' column to original string following melt using patterns](https://stackoverflow.com/questions/41883573/convert-numeric-representation-of-variable-column-to-original-string-following/41884029#41884029) (which was linked to above) with the new `measure` function, available in `data.table 1.14.1`. The function allows us to avoid the conversion of variable names to integer. See also the [corresponding PR](https://github.com/Rdatatable/data.table/pull/4731) for more details. – Henrik May 16 '21 at 15:26
  • Well, this was just a heads-up if you haven't already seen the function, and if you wish to incorporate the `measure` functionality in your nice answer. Cheers – Henrik May 16 '21 at 15:31
1

I'm not sure about using melt, but here's a way using tidyr

Note that I changed the variable name to use a . instead of _ to separate the name for the old/new. This makes it easier to separate the name into two variables since there are already many underscores.

library(tidyr)

df <- dplyr::data_frame(
  id = c(1:100),
  Credit_risk_Capital.old= rnorm(100, mean = 400, sd = 60),
  NameConcentration.old= rnorm(100, mean = 100, sd = 10),
  Credit_risk_Capital.new =rnorm(100, mean = 200, sd = 10),
  NameConcentration.new = rnorm(100, mean = 40, sd = 10)
)

df %>% 
  gather("key", "value", -id) %>% 
  separate(key, c("CapitalChargeType", "new_old"), sep = "\\.") %>% 
  spread(new_old, value)

#> # A tibble: 200 x 4
#>       id   CapitalChargeType       new       old
#> *  <int>               <chr>     <dbl>     <dbl>
#> 1      1 Credit_risk_Capital 182.10955 405.78530
#> 2      1   NameConcentration  42.21037  99.44172
#> 3      2 Credit_risk_Capital 184.28810 370.14308
#> 4      2   NameConcentration  60.92340 120.13933
#> 5      3 Credit_risk_Capital 191.07982 389.50818
#> 6      3   NameConcentration  25.81776  90.91502
#> 7      4 Credit_risk_Capital 193.64247 327.56853
#> 8      4   NameConcentration  32.71050  94.95743
#> 9      5 Credit_risk_Capital 208.63547 286.59351
#> 10     5   NameConcentration  40.76064 116.52747
#> # ... with 190 more rows
austensen
  • 2,857
  • 13
  • 24
  • the Tidyr version works, but I was testing out the data.tables functionality. I should have been clearer with my question. – ashleych Jun 01 '17 at 04:29
1

While the question is very old, a newer answer might help those directed to this question via a search. In data.table's most recent development version, there is a new measure function for melt, from which you can do:

df <-data.table(
  id = c(1:100),
  Credit_risk_Capital_old= rnorm(100, mean = 400, sd = 60),
  NameConcentration_old= rnorm(100, mean = 100, sd = 10),
  Credit_risk_Capital_New =rnorm(100, mean = 200, sd = 10),
  NameConcentration_New = rnorm(100, mean = 40, sd = 10)
)

melt(df,
     id.vars = "id",
     measure(CapitalChargeType, value.name,
             pattern = "(.*)_(New|old)"))

To get the output:

        id   CapitalChargeType       old       New
     <int>              <char>     <num>     <num>
  1:     1 Credit_risk_Capital 409.89004 210.30058
  2:     2 Credit_risk_Capital 403.15172 197.26172
  3:     3 Credit_risk_Capital 374.90492 192.21152
  4:     4 Credit_risk_Capital 509.17491 195.39095
  5:     5 Credit_risk_Capital 429.48302 197.44441
 ---                                              
196:    96   NameConcentration  80.64747  37.61926
197:    97   NameConcentration 104.39483  13.86576
198:    98   NameConcentration 106.87475  23.15775
199:    99   NameConcentration 112.92373  44.51562
200:   100   NameConcentration 111.80915  38.40075

The new version should come on CRAN in some time, but until then, you can use the development version. I'll try to update this answer when the version moves to CRAN.