5

Is there any way to retain names of the original levels of variables that are being melted? For example, in the example below, is there any way to get "alpha", "beta", and "gamma" instead of "1", "2", "3".

I can, of course, rename them but the dataset with which I am working with has tons of levels and so renaming them will be time-consuming and error-prone.

Thanks.

library(data.table)
#> Warning: package 'data.table' was built under R version 3.4.2
set.seed(2334)

# define the dataframe
df <-
  as.data.frame(
    cbind(
      a_alpha = rnorm(10),
      a_beta = rnorm(10),
      a_gamma = rnorm(10),
      b_alpha = rnorm(10),
      b_beta = rnorm(10),
      b_gamma = rnorm(10),
      id = c(1:10)
    )
  )

# check the structure of the wide format data
str(df)
#> 'data.frame':    10 obs. of  7 variables:
#>  $ a_alpha: num  -0.118 1.237 0.809 -0.766 -0.592 ...
#>  $ a_beta : num  0.0019 1.0639 2.336 0.9056 0.6449 ...
#>  $ a_gamma: num  0.5485 0.8345 -0.5977 0.0827 0.2754 ...
#>  $ b_alpha: num  0.209 -0.305 0.434 -0.362 0.412 ...
#>  $ b_beta : num  -1.6404 2.8382 0.0661 0.7249 -0.4421 ...
#>  $ b_gamma: num  -0.144 0.964 -0.763 -1.356 0.995 ...
#>  $ id     : num  1 2 3 4 5 6 7 8 9 10

# convert to long format
df_long <- data.table::melt(
  data.table::setDT(df),
  measure = patterns("^a_", "^b_"),
  value.name = c("a", "b"),
  variable.name = "item"
)

# check the structure of the long format data
str(df_long)
#> Classes 'data.table' and 'data.frame':   30 obs. of  4 variables:
#>  $ id  : num  1 2 3 4 5 6 7 8 9 10 ...
#>  $ item: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ a   : num  -0.118 1.237 0.809 -0.766 -0.592 ...
#>  $ b   : num  0.209 -0.305 0.434 -0.362 0.412 ...
#>  - attr(*, ".internal.selfref")=<externalptr>

# structure of item
levels(df_long$item)
#> [1] "1" "2" "3"

# Question: instead of "1" "2" "3", how to get the "item" factor levels to be: "alpha" "beta" "gamma"

Created on 2018-01-12 by the reprex package (v0.1.1.9000).

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
Indrajeet Patil
  • 4,673
  • 2
  • 20
  • 51
  • I'm not sure where you think `data.table` would get the `"alpha" "beta" "gamma"` strings from since they don't exist (in that form) in your original table. – Gregor Thomas Jan 13 '18 at 04:21
  • But the original column names that are being melted are: `a_alpha a_beta a_gamma b_alpha b_beta b_gamma` If I were to use- `stats::reshape(df, timevar = "item", varying = base::dput(as.character(as.vector(names( df[, grep("^a_|^b_")], direction = "long", idvar = "id", sep = "_")` The item levels **will be** "alpha" "beta" "gamma". The `data.table` instead uses "1" "2" "3", irrespective of what you give. I like the parsimonius `patterns` argument of `data.table` and that's why I want to use it. – Indrajeet Patil Jan 13 '18 at 04:40
  • 1
    Take a look at [this](https://gist.github.com/mrdwab/ceded54b616bdc61cb0f) for some inspiration. You'll basically have to use `factor` or `match` or something of the sort. – A5C1D2H2I1M1N2O1R2T1 Jan 13 '18 at 04:41
  • 1
    I think that's a consequence of how the two packages find the columns... `data.table`'s `grep` is too flexible to reliably pull out the part of the column name that is wanted (though it could work in your case). Whereas the `sep` version implicitly could be sent to `strsplit` to break the string apart as needed. With the parsimony flexibility is lost. – Gregor Thomas Jan 13 '18 at 04:43
  • Note that the approach I've shared would also depend on the order of columns, which is one of the reasons I've not finalized the function. – A5C1D2H2I1M1N2O1R2T1 Jan 13 '18 at 04:54
  • 1
    @Henrik Thank you so much!!! That totally worked for me :) – Indrajeet Patil Jan 13 '18 at 13:55
  • 1
    @IndrajeetPatil Please note that I have updated [my answer in the linked question](https://stackoverflow.com/a/41884029/1851712). From `data.table 1.14.1`, the new function `measure` keeps the original string of concatenated variable names. – Henrik May 16 '21 at 17:10

1 Answers1

1

The way I've dealt with this in the past is to use factor after melting the data. However, you'll have to probably do some checks to make sure the data and levels are in the correct order.

Here's an example:

set.seed(2334)
df <- data.table(a_alpha = rnorm(10), a_beta = rnorm(10), a_gamma = rnorm(10),
                 b_alpha = rnorm(10), b_beta = rnorm(10), b_gamma = rnorm(10), 
                 id = c(1:10))
df_mess <- copy(df)
setcolorder(df_mess, c(1, 7, 6, 4, 2, 5, 3))
names(df_mess)
# [1] "b_alpha" "id"      "a_alpha" "a_beta"  "b_beta"  "b_gamma" "a_gamma"

stubs <- c("^a_", "^b_")
## assumes all stubs have same number of cols. Easy to modify
labs <- grep(stubs[1], names(df_mess), value = TRUE) 
labs <- gsub(paste(stubs, collapse = "|"), "", labs[order(labs)])

out1 <- melt(df, measure.vars = patterns(stubs), value.name = c("a", "b"), 
             variable.name = "item")[
               , item := factor(item, labels = labs)][]

out2a <- melt(df_mess, measure.vars = patterns(stubs), value.name = c("a", "b"), 
              variable.name = "item")[
                , item := factor(item, labels = labs)][]

out2b <- melt(setcolorder(df_mess, names(df_mess)[order(names(df_mess))]),
     measure.vars = patterns(stubs), value.name = c("a", "b"), 
     variable.name = "item")[
       , item := factor(item, labels = labs)][]

library(compare)
compare(out1, out2a)
# FALSE [TRUE, TRUE, TRUE, FALSE]
compare(out1, out2b)
# TRUE

I haven't done enough test cases to say with confidence whether using order on the names and the levs is sufficient for all cases, but so far, I haven't only found exceptions when the data are not balanced.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485