43

I need to reshape my wide table into long format but keeping multiple fields for each record, for example:

dw <- read.table(header=T, text='
 sbj f1.avg f1.sd f2.avg f2.sd  blabla
   A   10    6     50     10      bA
   B   12    5     70     11      bB
   C   20    7     20     8       bC
   D   22    8     22     9       bD
 ')

# Now I want to melt this table, keeping both AVG and SD as separate fields for each measurement, to get something like this:

 #    sbj var avg  sd  blabla
 #     A   f1  10  6     bA
 #     A   f2  50  10    bA
 #     B   f1  12  5     bB
 #     B   f2  70  11    bB
 #     C   f1  20  7     bC
 #     C   f2  20  8     bC
 #     D   f1  22  8     bD
 #     D   f2  22  9     bD

I have basic knowledge of using melt and reshape, but it is not obvious for me how to apply such reshaping in my case.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Vasily A
  • 8,256
  • 10
  • 42
  • 76
  • 4
    This should be a fairly standard `reshape` from "wide" to "long" - see here: http://stackoverflow.com/questions/14638747/how-do-i-stack-only-some-columns-in-a-data-frame – thelatemail May 30 '14 at 01:09

5 Answers5

36

reshape does this with the appropriate arguments.

varying lists the columns which exist in the wide format, but are split into multiple rows in the long format. v.names is the long format equivalents. Between the two, a mapping is created.

From ?reshape:

Also, guessing is not attempted if v.names is given explicitly. Notice that the order of variables in varying is like x.1,y.1,x.2,y.2.

Given these varying and v.names arguments, reshape is smart enough to see that I've specified that the index is before the dot here (i.e., order 1.x, 1.y, 2.x, 2.y). Note that the original data has the columns in this order, so we can specify varying=2:5 for this example data, but that is not safe in general.

Given the values of times and v.names, reshape splits the varying columns on a . character (the default sep argument) to create the columns in the output.

times specifies values that are to be used in the created var column, and v.names are pasted onto these values to get column names in the wide format for mapping to the result.

Finally, idvar is specified to be the sbj column, which identifies individual records in the wide format (thanks @thelatemail).

reshape(dw, direction='long', 
        varying=c('f1.avg', 'f1.sd', 'f2.avg', 'f2.sd'), 
        timevar='var',
        times=c('f1', 'f2'),
        v.names=c('avg', 'sd'),
        idvar='sbj')

##      sbj blabla var avg sd
## A.f1   A     bA  f1  10  6
## B.f1   B     bB  f1  12  5
## C.f1   C     bC  f1  20  7
## D.f1   D     bD  f1  22  8
## A.f2   A     bA  f2  50 10
## B.f2   B     bB  f2  70 11
## C.f2   C     bC  f2  20  8
## D.f2   D     bD  f2  22  9
Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • 1
    Adding `idvar='sbj'` will avoid the adding of the `id` column to the result too. – thelatemail May 30 '14 at 01:36
  • 1
    surprisingly (at least to me), the order of the columns matters. If you reorder the columns dw = dw[,c(1,4,3,2,5,6)] and then run this answer, the result is incorrect. – Brian D Jan 24 '17 at 23:14
  • @BrianD So it does, thanks! I was tacitly using an assumption of the data (even without realizing it) and will fix the answer shortly. – Matthew Lundberg Jan 24 '17 at 23:30
  • 5
    To make this answer more robust at handling many pairs of "avg" and "sd" columns replace the `varying` argument with `varying=list(grep("avg", colnames(dw), value=T), grep("sd", colnames(dw), value=T))` – Robin Sep 20 '18 at 21:52
27

Another option using Hadley's new tidyr package.

library(tidyr)
library(dplyr)

dw <- read.table(header=T, text='
 sbj f1.avg f1.sd f2.avg f2.sd  blabla
   A   10    6     50     10      bA
   B   12    5     70     11      bB
   C   20    7     20     8       bC
   D   22    8     22     9       bD
 ')

dw %>% 
  gather(v, value, f1.avg:f2.sd) %>% 
  separate(v, c("var", "col")) %>% 
  arrange(sbj) %>% 
  spread(col, value)
hadley
  • 102,019
  • 32
  • 183
  • 245
Maiasaura
  • 32,226
  • 27
  • 104
  • 108
  • 4
    Helpful info about tidyr and how the above code works here: https://blog.rstudio.org/2014/07/22/introducing-tidyr/ – Brian D Jan 25 '17 at 14:23
15

melt from the >=1.9.6 version of data.table, does this by specifying the column index in measure.vars as a list.

 melt(setDT(dw), measure.vars=list(c(2,4), c(3,5)), 
     variable.name='var', value.name=c('avg', 'sd'))[, 
      var:= paste0('f',var)][order(sbj)]
#   sbj blabla var avg sd
#1:   A     bA  f1  10  6
#2:   A     bA  f2  50 10
#3:   B     bB  f1  12  5
#4:   B     bB  f2  70 11
#5:   C     bC  f1  20  7
#6:   C     bC  f2  20  8
#7:   D     bD  f1  22  8
#8:   D     bD  f2  22  9

Or you could use the new patterns function:

melt(setDT(dw), 
     measure = patterns("avg", "sd"),
     variable.name = 'var', value.name = c('avg', 'sd'))
#    sbj blabla var avg sd
# 1:   A     bA   1  10  6
# 2:   B     bB   1  12  5
# 3:   C     bC   1  20  7
# 4:   D     bD   1  22  8
# 5:   A     bA   2  50 10
# 6:   B     bB   2  70 11
# 7:   C     bC   2  20  8
# 8:   D     bD   2  22  9
Jaap
  • 81,064
  • 34
  • 182
  • 193
akrun
  • 874,273
  • 37
  • 540
  • 662
9

This seems to do what you want except that the f is removed from elements in time.

reshape(dw, idvar = "sbj", varying = list(c(2,4),c(3,5)), v.names = c("ave", "sd"), direction = "long")

    sbj blabla time ave sd
A.1   A     bA    1  10  6
B.1   B     bB    1  12  5
C.1   C     bC    1  20  7
D.1   D     bD    1  22  8
A.2   A     bA    2  50 10
B.2   B     bB    2  70 11
C.2   C     bC    2  20  8
D.2   D     bD    2  22  9
Mark Miller
  • 12,483
  • 23
  • 78
  • 132
  • aha, you do almost the same call as in previous answer except for using `idvar` instead of `timevar` and `times` - I will sort out the difference... Thanks! – Vasily A May 30 '14 at 01:25
  • 2
    @VasilyA The structure of `varying` is also very different between the answers. Indeed, the only things they seem to have in common is they both use the same `v.names` and both use `direction="long"`. – Mark Miller May 30 '14 at 01:27
  • indeed, now I see... thanks for pointing out! – Vasily A May 30 '14 at 01:30
  • 2
    The interesting thing in comparing the two correct answers is that this one uses the structure of the `list` argument to 'varying', while the Lundberg answer uses 'times' and 'v.names' to convey some of the structure. I've never gotten those options straightened out in my head and usually use trial and error. – IRTFM May 30 '14 at 01:41
  • To clarify, the code for this answer does not remove the `f` from the column names to create the elements. Without, the `times` argument the column names aren't even considered. When creating the dataframe `dw`, I can replace the "f1" with "f1.alpha" and "f2" with "f2.beta", run the code in this answer, and the "time" column will be the same as in this answer. It won't be "1.alpha" or "2.beta". If someone wants to keep the column names as elements, the column names need to be specified in the `times` argument (in the same order as in the `varying` argument). – Robin Sep 20 '18 at 21:44
5

To add to the options available here, you can also consider merged.stack from my "splitstackshape" package:

library(splitstackshape)
merged.stack(dw, var.stubs = c("avg", "sd"), sep = "var.stubs", atStart = FALSE)
#    sbj blabla .time_1 avg sd
# 1:   A     bA     f1.  10  6
# 2:   A     bA     f2.  50 10
# 3:   B     bB     f1.  12  5
# 4:   B     bB     f2.  70 11
# 5:   C     bC     f1.  20  7
# 6:   C     bC     f2.  20  8
# 7:   D     bD     f1.  22  8
# 8:   D     bD     f2.  22  9

You can also do a little more cleanup on the ".time_1" variable, like this.

merged.stack(dw, var.stubs = c("avg", "sd"), 
             sep = "var.stubs", atStart = FALSE)[, .time_1 := sub(
               ".", "", .time_1, fixed = TRUE)][]
#    sbj blabla .time_1 avg sd
# 1:   A     bA      f1  10  6
# 2:   A     bA      f2  50 10
# 3:   B     bB      f1  12  5
# 4:   B     bB      f2  70 11
# 5:   C     bC      f1  20  7
# 6:   C     bC      f2  20  8
# 7:   D     bD      f1  22  8
# 8:   D     bD      f2  22  9

You would note the use of the atStart = FALSE argument. This is because your names are in a little bit of a different order than reshape-related functions seem to like. In general, the "stub" is expected to come first, and then the "times", like this:

dw2 <- dw
setnames(dw2, gsub("(.*)\\.(.*)", "\\2.\\1", names(dw2)))
names(dw2)
# [1] "sbj"    "avg.f1" "sd.f1"  "avg.f2" "sd.f2"  "blabla"

If the names were in that format, then both base R's reshape and merged.stack benefit from more direct syntax:

merged.stack(dw2, var.stubs = c("avg", "sd"), sep = ".")
reshape(dw2, idvar = c("sbj", "blabla"), varying = 2:5, 
        sep = ".", direction = "long")
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485