4

I have a data frame like this:

structure(list(one = structure(1:4, .Label = c("a", "b", "c", 
"d"), class = "factor"), two = c(2, 4, 7, 3), x.1 = c("x1a", 
"x1b", "x1c", "x1d"), x.2 = c("x2a", "x2b", "x2c", "x2d"), x.3 = c("x3a", 
"x3b", "x3c", "x3d"), y.1 = c(NA, "y1b", "y1c", NA), y.2 = c(NA, 
"y2b", "y2c", NA), y.3 = c(NA, "y3b", "y3c", NA)), .Names = c("one", 
"two", "x.1", "x.2", "x.3", "y.1", "y.2", "y.3"), row.names = c(NA, 
-4L), class = "data.frame")

As you can see, the observations per event a, b, c, and d (variable "one") are stored as columns, where x and y define separate observations and 1, 2 and 3 define the variables. Variable "two" does not have a meaning here.

I like to reshape this data frame to have it tidy in the form that each observation has it's own row and each variable it's own column.

The final data frame should look like this:

structure(list(one = structure(c(1L, 2L, 2L, 3L, 3L, 4L), .Label = c("a", 
"b", "c", "d"), class = "factor"), two = c(2, 4, 2, 7, 5, 3), 
var1 = c("x1a", "x1b", "y1b", "x1c", "y1c", "x1d"), var2 = c("x2a", 
"x2b", "y2b", "x2c", "y2c", "x2d"), var3 = c("x3a", "x3b", 
"y3b", "x3c", "y3c", "x3d")), .Names = c("one", "two", "var1", 
"var2", "var3"), row.names = c(1L, 2L, 5L, 3L, 6L, 4L), class = "data.frame")

I am slightly familiar with what the cast and melt function from the reshape packages do, but was not able yet to figure out a way to reshape the DF in a smart way. For now the following provides the sate that I have gotten to:

df.between <- melt(df.in, id.vars=c("one", "two"))
df.between$variable <- gsub("x.|y.", "", df.between$variable)

Now the "variable" column does correctly identify the variable (1, 2 or 3). However, I was not able to cast this into the required form and this solution does not seem to be useful for larger sets of data due to the use of grepl.

Happy to get a nudge into the right direction here.

Jaap
  • 81,064
  • 34
  • 182
  • 193
fr3d-5
  • 792
  • 1
  • 6
  • 27

3 Answers3

5

We can use melt from the devel version of data.table i.e. v1.9.5, which can handle multiple patterns for the measure variables.

library(data.table)
melt(setDT(df1), measure=patterns('.1', '.2', '.3'),
      na.rm=TRUE, value.name=paste0('var', 1:3))[, variable:=NULL][order(one)]
#   one two var1 var2 var3
#1:   a   2  x1a  x2a  x3a
#2:   b   4  x1b  x2b  x3b
#3:   b   4  y1b  y2b  y3b
#4:   c   7  x1c  x2c  x3c
#5:   c   7  y1c  y2c  y3c
#6:   d   3  x1d  x2d  x3d

EDIT: We don't need c inside the patterns and it will also give exact matches (from @Jaap's comments).

akrun
  • 874,273
  • 37
  • 540
  • 662
3

melt from "data.table" will be much faster than the following, but you could also consider merged.stack from my "splitstackshape" package:

library(splitstackshape)
na.omit(merged.stack(mydf, var.stubs = c(".1", ".2", ".3"),
                     sep = "var.stubs", atStart = FALSE))

#    one two .time_1  .1  .2  .3
# 1:   a   2       x x1a x2a x3a
# 2:   b   4       x x1b x2b x3b
# 3:   b   4       y y1b y2b y3b
# 4:   c   7       x x1c x2c x3c
# 5:   c   7       y y1c y2c y3c
# 6:   d   3       x x1d x2d x3d
Jaap
  • 81,064
  • 34
  • 182
  • 193
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
2

You were almost there with the reshape-route, so I finished it for you. All you needed was a differentiation between x and y variables. (which are later easy to remove if you don't want or need them). I've left the missings in because they are easy to remove and to prevent silent removal of missing data.

df.between <- melt(df.in, id.vars=c("one", "two"))
#replace with 'var' so no numeric column names.
df.between$variable_n <- gsub("x.|y.", "var", df.between$variable)
df.between$variable_xy <- gsub(".[0-9]","",df.between$variable)

res <- dcast(one+two+variable_xy~variable_n,value.var="value",data=df.between)

    > res
  one two variable_xy var1 var2 var3
1   a   2           x  x1a  x2a  x3a
2   a   2           y <NA> <NA> <NA>
3   b   4           x  x1b  x2b  x3b
4   b   4           y  y1b  y2b  y3b
5   c   7           x  x1c  x2c  x3c
6   c   7           y  y1c  y2c  y3c
7   d   3           x  x1d  x2d  x3d
8   d   3           y <NA> <NA> <NA>
Heroka
  • 12,889
  • 1
  • 28
  • 38