reshaping a data frame into long format in R

Question

I'm struggling with a reshape in R. I have 2 types of error (err and rel_err) that have been calculated for 3 different models. This gives me a total of 6 error variables (i.e. err_1, err_2, err_3, rel_err_1, rel_err_2, and rel_err_3). For each of these types of error I have 3 different types of predivtive validity tests (ie random holdouts, backcast, forecast). I would like to make my data set long so I keep the 4 types of test long while also making the two error measurements long. So in the end I will have one variable called err and one called rel_err as well as an id variable for what model the error corresponds to (1,2,or 3)

Here is my data right now:

iter       err_1  rel_err_1      err_2  rel_err_2      err_3  rel_err_3 test_type
1 -0.09385732 -0.2235443 -0.1216982 -0.2898543 -0.1058366 -0.2520759    random
1  0.16141630  0.8575728  0.1418732  0.7537442  0.1584816  0.8419816    back
1  0.16376930  0.8700738  0.1431505  0.7605302  0.1596502  0.8481901    front
1  0.14345986  0.6765194  0.1213689  0.5723444  0.1374676  0.6482615    random
1  0.15890059  0.7435382  0.1589823  0.7439204  0.1608709  0.7527580    back
1  0.14412360  0.6743928  0.1442039  0.6747684  0.1463520  0.6848202    front

and here is what I would like it to look like:

iter     model    err           rel_err    test_type
1        1        -0.09385732    (#'s)     random
1        2        -0.1216982     (#'s)     random
1        3        -0.1216982     (#'s)     random

and on...

I've tried playing around with the syntax but can't quite figure out what to put for the time.varying argument

Thanks very much for any help you can offer.

I'd check out the reshape2 package first for this sort of thing. If you really want to learn base `reshape` check out two of my blog posts on it: [(LINK)](http://trinkerrstuff.wordpress.com/category/reshape/) — Tyler Rinker, Oct 25 '12 at 05:16
I asked a very similar question not long ago and got some pretty good answers: [link](http://stackoverflow.com/questions/12837609/reshaping-a-data-frame-with-more-than-one-measure-variable) — eli-k, Oct 25 '12 at 21:41

John · Answer 1 · 2012-10-25T22:49:35.470

5

You could do it the "hard" way. For transparency you can use names.

with( dat, data.frame(iter = rep(iter, 3), 
      model = rep(1:3, each = nrow(dat)),
      err = c(err_1, err_2, err_3), 
      rel_err = c(rel_err_1, rel_err_2, rel_err_3), 
      test_type = rep(test_type, 3)) )

Or, for conciseness, indexes.

data.frame(iter = dat[,1], model = rep(1:3, each = nrow(dat)), err = dat[,c(2, 4, 6)], 
          rel_err = dat[,c(3, 5, 7)], test_type = dat[,8]) )

If you had a LOT of columns the hard way might involve grepping the column names.

This "hard" way was about as concise as reshape and required less thinking about how to use the commands. Sometimes I just skip thinking about reshape.

edited Oct 25 '12 at 22:49

answered Oct 25 '12 at 11:24

John

23,360
7
57
83

this approach looks so much simpler than reshape (reshape2 tools too) - what's the catch? – eli-k Oct 25 '12 at 21:45
1

If you have your columns named well in your wide data, then reshape could be more concise. Also, the complexity of this code scales linearly with the complexity of the data. That's not necessarily the case for reshape(2) syntax. Finally, if you're moving back and forth the new data.frame has properties that let `reshape` know how to put it back. So, you could just `reshape(longDat)` with no extra arguments to return it. – John Oct 25 '12 at 22:49

mnel · Answer 2 · 2012-10-25T22:57:31.457

The base function reshape will let you do this

reshape(DT, direction = 'long', varying = list(paste('err',1:3,sep ='_'), paste('rel_err',1:3,sep ='_')), v.names = c('err','rel_err'), timevar = 'model')
    iter test_type model         err    rel_err id
1.1    1    random     1 -0.09385732 -0.2235443  1
2.1    1      back     1  0.16141630  0.8575728  2
3.1    1     front     1  0.16376930  0.8700738  3
4.1    1    random     1  0.14345986  0.6765194  4
5.1    1      back     1  0.15890059  0.7435382  5
6.1    1     front     1  0.14412360  0.6743928  6
1.2    1    random     2 -0.12169820 -0.2898543  1
2.2    1      back     2  0.14187320  0.7537442  2
3.2    1     front     2  0.14315050  0.7605302  3
4.2    1    random     2  0.12136890  0.5723444  4
5.2    1      back     2  0.15898230  0.7439204  5
6.2    1     front     2  0.14420390  0.6747684  6
1.3    1    random     3 -0.10583660 -0.2520759  1
2.3    1      back     3  0.15848160  0.8419816  2
3.3    1     front     3  0.15965020  0.8481901  3
4.3    1    random     3  0.13746760  0.6482615  4
5.3    1      back     3  0.16087090  0.7527580  5
6.3    1     front     3  0.14635200  0.6848202  6

I agree that the syntax for reshape hard to get your head around sometimes. I will spell out how this call works

direction = 'long' -- reshaping to long format
varying = list(paste('err',1:3,sep ='_'), paste('rel_err',1:3,sep ='_')) -- We pass a list of length 2 because we are trying to stack into two different variables. The columns paste('err',1:3,sep ='_') will become the first new variable in long format and paste('rel_err',1:3,sep ='_')) will become the second new variable in long format
v.names = c('err','rel_err') sets the names of the two new variables in long format
timevar = 'model' sets the name of the time identifier (here the _1 from the columns in wide format.

I hope this is somewhat clearer.

reshaping a data frame into long format in R

2 Answers2