3

I want to convert a dataframe from wide format to long format.

Here it is a toy example:

mydata <- data.frame(ID=1:5, ZA_1=1:5, 
            ZA_2=5:1,BB_1=rep(3,5),BB_2=rep(6,5),CC_7=6:2)

ID ZA_1 ZA_2 BB_1 BB_2 CC_7
1    1    5    3    6    6
2    2    4    3    6    5
3    3    3    3    6    4
4    4    2    3    6    3
5    5    1    3    6    2

There are some variables that will remain as is (here only ID) and some that will be transformed to long format (here all other variables, all ending with _1, _2 or _7)

In order to transform it to long format I'm using data.table melt and dcast, a generic way able to detect the variables automatically. Other solutions are welcome too.

library(data.table)
setDT(mydata)
idvars =  grep("_[1-7]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)  
nuevo <- dcast(
  temp[, `:=`(var = sub("_[1-7]$", '', variable),
  measure = sub('.*_', '', variable), variable = NULL)],  
  ... ~ var, value.var='value') 



ID measure BB  CC  ZA
 1      1   3  NA   1
 1      2   6  NA   5
 1      7  NA   6  NA
 2      1   3  NA   2
 2      2   6  NA   4
 2      7  NA   5  NA
 3      1   3  NA   3
 3      2   6  NA   3
 3      7  NA   4  NA
 4      1   3  NA   4
 4      2   6  NA   2
 4      7  NA   3  NA
 5      1   3  NA   5
 5      2   6  NA   1
 5      7  NA   2  NA

As you can see the columns are reoredered alphabetically, but I would prefer to keep the original order as far as possible, for example taking into account the order of the first appearance of the variable.

ID ZA_1 ZA_2 BB_1 BB_2 CC_7

Should be

ID ZA BB CC

I don't mind if the idvars columns come alltogether at the beginning or if they also stay in their original position.

ID ZA_1 ZA_2 TEMP BB_1 BB_2 CC_2 CC_1

would be

ID ZA TEMP BB CC

or

ID TEMP ZA BB CC

I prefer the last option.

Another problem is that everything gets transformed to character.

Jaap
  • 81,064
  • 34
  • 182
  • 193
skan
  • 7,423
  • 14
  • 59
  • 96
  • As far as I've seen they are simple solutions when you specify variable names manually. But I need an automated way to do it, with grep, because my whole dataset has 3500 variables. – skan Dec 15 '16 at 11:58
  • @Jaap the question is not the same because I focus on two problems: I don't want the output to be reordered, and I need a memory efficient solution, not just the trivial solution as the the other question you link, and my pattern is more complex. Uwe's solution is the good solution, and it was already accepted one year ago. Nobody at your link provides a good solution, so please remove your "duplicate question" tag from this question because it's not a duplicated. – skan Apr 10 '18 at 22:36
  • 1
    Thx for the explanation. I've reopend the question & also posted an alternative solution. – Jaap Apr 11 '18 at 05:56

5 Answers5

2

You can melt several columns simultaneously, if you pass a list of column names to the argument measure =. One approach to do this in a scalable manner would be to:

  1. Extract the column names and the corresponding first two letters:

    measurevars <- names(mydata)[grepl("_[1-9]$",names(mydata))]
    groups <- gsub("_[1-9]$","",measurevars)
    
  2. Turn groups into a factor object and make sure levels aren't ordered alphabetically. We'll use this in the next step to create a list object with the correct structure.

    split_on <- factor(groups, levels = unique(groups))
    
  3. Create a list using measurevars with split(), and create vector for the value.name = argument in melt().

    measure_list <- split(measurevars, split_on)
    measurenames <- unique(groups)
    

Bringing it all together:

melt(setDT(mydata), 
     measure = measure_list, 
     value.name = measurenames,
     variable.name = "measure")
#    ID measure ZA BB
# 1:  1       1  1  3
# 2:  2       1  2  3
# 3:  3       1  3  3
# 4:  4       1  4  3
# 5:  5       1  5  3
# 6:  1       2  5  6
# 7:  2       2  4  6
# 8:  3       2  3  6
# 9:  4       2  2  6
#10:  5       2  1  6
mtoto
  • 23,919
  • 4
  • 58
  • 71
  • My whole dataset has 3500 variables. This is just a toy example. That's why I'm using the more complex way to find automatically the variables to melt and cast, with grep. – skan Dec 15 '16 at 12:03
  • put it simple, they all end with _x where x is a single digit. – skan Dec 15 '16 at 12:10
  • And how do you avoid that everything is casted to characters? – skan Dec 15 '16 at 12:46
  • I've tried your code and measurenames <- as.character(unique(levels)) produces an error: Error in unique.default(levels) : "unique() applies only to vectors". I guess you need to add split_on inside levels() – skan Dec 15 '16 at 13:14
  • For future use I will compare your method (only needs a melt) with mine (needs melt and dcast) and benchmark both in speed and memory usage. – skan Dec 15 '16 at 13:22
  • 1
    And I think is better to use gsub("_[1-9]$","",measurevars) instead of groups <- gsub("[^A-Z]","", measurevars). It won't modify any number or symbol inside the main name – skan Dec 15 '16 at 13:57
  • If I try with **mydata <- data.frame(ID=1:5, ZA_2001=1:5, ZA_2002=5:1,BB_2001=rep(3,5),BB_2002=rep(6,5), CC_2007=rep(6,5))** and using **"_20[0-9][0-9]$"** the measure column is populated with 1 and 2 instead of 2001, 2002 and 2007, why? – skan Jan 09 '17 at 20:53
  • or just mydata <- data.frame(ID=1:5, ZA_1=1:5, ZA_2=5:1,BB_1=rep(3,5),BB_2=rep(6,5), CC_7=rep(6,5)). The CC_7 values are missplaced. I think your method expects all the variable names to have the same number of variations. And it doesn't use the _xxxx information, only the position in the list. – skan Jan 09 '17 at 21:17
2

The OP has updated his answer to his own question complaining about the memory consumption of the intermediate melt() step when half of the columns are id.vars. He requested that data.table needs a direct way to do it without creating giant middle steps.

Well, data.table already does have that ability, it's called join.

Given the sample data from the Q, the whole operation can be implemented in a less memory consuming way by reshaping with only one id.var and later joining the reshaped result with the original data.table:

setDT(mydata)

# add unique row number to join on later 
# (leave `ID` col as placeholder for all other id.vars)
mydata[, rn := seq_len(.N)]

# define columns to be reshaped
measure_cols <- stringr::str_subset(names(mydata), "_\\d$")

# melt with only one id.vars column
molten <- melt(mydata, id.vars = "rn", measure.vars = measure_cols)

# split column names of measure.vars
# Note that "variable" is reused to save memory 
molten[, c("variable", "measure") := tstrsplit(variable, "_")]

# coerce names to factors in the same order as the columns appeared in mydata
molten[, variable := forcats::fct_inorder(variable)]

# remove columns no longer needed in mydata _before_ joining to save memory
mydata[, (measure_cols) := NULL]

# final dcast and right join
result <- mydata[dcast(molten, ... ~ variable), on = "rn"]
result
#    ID rn measure ZA BB CC
# 1:  1  1       1  1  3 NA
# 2:  1  1       2  5  6 NA
# 3:  1  1       7 NA NA  6
# 4:  2  2       1  2  3 NA
# 5:  2  2       2  4  6 NA
# 6:  2  2       7 NA NA  5
# 7:  3  3       1  3  3 NA
# 8:  3  3       2  3  6 NA
# 9:  3  3       7 NA NA  4
#10:  4  4       1  4  3 NA
#11:  4  4       2  2  6 NA
#12:  4  4       7 NA NA  3
#13:  5  5       1  5  3 NA
#14:  5  5       2  1  6 NA
#15:  5  5       7 NA NA  2

Finally, you may remove the row number if no longer needed by result[, rn := NULL].

Furthermore, you can remove the intermediate molten by rm(molten).

We have started with a data.table consisting of 1 id column, 5 measure cols and 5 rows. The reshaped result has 1 id column, 3 measure cols, and 15 rows. So, the data volume stored in id columns effectively has tripled. However, the intermediate step needed only one id.var rn.

EDIT If memory consumption is crucial, it might be worthwhile to consider to keep the id.vars and the measure.vars in two separate data.tables and to join only the necessary id.var columns with the measure.vars on demand.

Note that the measure.vars parameter to melt()allows for a special function patterns(). With this the call to melt() could have been written as well as

molten <- melt(mydata, id.vars = "rn", measure.vars = patterns("_\\d$"))
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • I'll try it. In order not to use extra packages can I use grep("_\\d$",names(mydata2), value = T ) and molten[,var:=factor(variable, levels=unique(variable))] instead? I've also tried patterns() before but I had some problems with it. – skan Mar 21 '17 at 14:39
  • Yes, of course you can, should work as well. However, I recommend to use these to two packages which have been initiated by Hadley Wickham in his efforts to improve usabilty of R. – Uwe Mar 21 '17 at 15:21
  • 1
    Please, watch out, it should read `molten[, variable := factor(variable, levels = unique(variable))]` instead of `molten[,var:=factor(variable, levels=unique(variable))]` in order to save memory. – Uwe Mar 21 '17 at 15:26
  • I think the "remove columns no longer needed ..." should be located after the cast, or maybe remove them as rows instead, isn't it? – skan Mar 21 '17 at 16:24
  • No, these are the old measure cols `ZA_1`, `ZA_2`, etc, in `mydata` whose contents are now part of `molten`. You may skip this statement and keep those columns in `mydata` for quality assurance but they are essentially redundant and only will contribute to the memory footprint of `result`. This is why I suggest to remove them _before_ the join (`dcast()`only uses `molten`. I've put join and `dcast()` in one statement avoiding to create another temporary variable). To scrutinize the code you may intersperse calls to `str(molten)`. – Uwe Mar 21 '17 at 16:51
  • Note that the old measure cols are removed from `mydata` not from `molten`. – Uwe Mar 21 '17 at 16:54
  • It works very well. Though I've tried a much larger dataset and found that R fills a lot of memory and it doesn't recover most of it even using gc(). I have to first break the data in chunks. I think R it's not the proper tool to deal with big data. – skan Mar 21 '17 at 23:09
  • "big" data is relative. By default, R is working in memory. So, you need a sufficiently sized hardware, anyway. However, the section on _Large memory and out-of-memory data_ in the [CRAN HPC task view](https://cran.r-project.org/web/views/HighPerformanceComputing.html) lists 11 packages which help to deal with this kind of situations. – Uwe Mar 21 '17 at 23:31
  • Today I've tried again and mydata[, rn := seq_len(.N)] produces this warning Warning message: In `[.data.table`(my, , `:=`(rn, seq_len(.N))) : Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). – skan Apr 25 '17 at 15:10
  • 1
    @skan It's only a warning about a potential inefficiency. `data.table` over-allocates the column list vector to have space for adding additional columns efficiently, see `?truelength`. Somewhere in the previous processing steps an operation (some are mentioned in the warning) have caused the over-allocation got lost. – Uwe Apr 25 '17 at 16:18
  • Another thing: My dataset is very large and I introduced your code in a loop appending the result in a file. Later when I try to read the file I get this warning "Warning: stack imbalance in '<-', 2 then 1" . What does it mean? – skan May 03 '17 at 11:07
  • @skan Haven't seen that error message before. I think it's better to post a new question including a [mcve]. Thank you. – Uwe May 03 '17 at 18:29
1

Here is a method using base R functions split.default and do.call.

# split the non-ID variables into groups based on their name suffix
myList <- split.default(mydata[-1], gsub(".*_(\\d)$", "\\1", names(mydata[-1])))

# append variables by row after setting the regularizing variable names, cbind ID
cbind(mydata[1],
      do.call(rbind, lapply(myList, function(x) setNames(x, gsub("_\\d$", "", names(x))))))
    ID ZA BB
1.1  1  1  3
1.2  2  2  3
1.3  3  3  3
1.4  4  4  3
1.5  5  5  3
2.1  1  5  6
2.2  2  4  6
2.3  3  3  6
2.4  4  2  6
2.5  5  1  6

The first line splits the data.frame variables (minus ID) into lists that agree on the final character of their variable name. This criterion is determined using gsub. The second line uses do.call to call rbind on this list of variables, modified with setNames so that the final digit and underscore are removed from their names. Finally, cbind attaches the ID to the resulting data.frame.

Note that the data has to be structured regularly, with no missing variables, etc.

lmo
  • 37,904
  • 9
  • 56
  • 69
1

An alternative approach with data.table:

melt(mydata, id = 'ID')[, c("variable", "measure") := tstrsplit(variable, '_')
                        ][, variable := factor(variable, levels = unique(variable))
                          ][, dcast(.SD, ID + measure ~ variable, value.var = 'value')]

which gives:

    ID measure ZA BB CC
 1:  1       1  1  3 NA
 2:  1       2  5  6 NA
 3:  1       7 NA NA  6
 4:  2       1  2  3 NA
 5:  2       2  4  6 NA
 6:  2       7 NA NA  5
 7:  3       1  3  3 NA
 8:  3       2  3  6 NA
 9:  3       7 NA NA  4
10:  4       1  4  3 NA
11:  4       2  2  6 NA
12:  4       7 NA NA  3
13:  5       1  5  3 NA
14:  5       2  1  6 NA
15:  5       7 NA NA  2
Jaap
  • 81,064
  • 34
  • 182
  • 193
0

Finally I've found the way, modifying my initial solution

mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1,
BB_2001=rep(3,5),BB_2002=rep(6,5),CC_2007=6:2)

idvars =  grep("_20[0-9][0-9]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)  
temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable), 
measure = sub('.*_', '', variable), variable = NULL)]  
temp[,var:=factor(var, levels=unique(var))]
dcast( temp,   ... ~ var, value.var='value' )

And it gives you the proper measure values. Anyway this solution needs a lot of memory.

The trick was converting the var variable to factor specifying the order I want with levels, as mtoto did. mtoto solution is nice because it doesn't need to cast and melt, only melt, but doesn't work in my updated example, only works when there are the same number of number variations for each word.

PD: I've being parsing every step and found that the melt step could be a big problem when working with large datatables. If you have a data.table with just 100000 rows x 1000 columns and use half of the columns as id.vars the output is approx 50000000 x 500, just too much to continue with the next step. data.table needs a direct way to do it without creating giant middle steps.

skan
  • 7,423
  • 14
  • 59
  • 96