How to turn multiple columns into observations

Question

I have a data frame like this:

structure(list(one = structure(1:4, .Label = c("a", "b", "c", 
"d"), class = "factor"), two = c(2, 4, 7, 3), x.1 = c("x1a", 
"x1b", "x1c", "x1d"), x.2 = c("x2a", "x2b", "x2c", "x2d"), x.3 = c("x3a", 
"x3b", "x3c", "x3d"), y.1 = c(NA, "y1b", "y1c", NA), y.2 = c(NA, 
"y2b", "y2c", NA), y.3 = c(NA, "y3b", "y3c", NA)), .Names = c("one", 
"two", "x.1", "x.2", "x.3", "y.1", "y.2", "y.3"), row.names = c(NA, 
-4L), class = "data.frame")

As you can see, the observations per event a, b, c, and d (variable "one") are stored as columns, where x and y define separate observations and 1, 2 and 3 define the variables. Variable "two" does not have a meaning here.

I like to reshape this data frame to have it tidy in the form that each observation has it's own row and each variable it's own column.

The final data frame should look like this:

structure(list(one = structure(c(1L, 2L, 2L, 3L, 3L, 4L), .Label = c("a", 
"b", "c", "d"), class = "factor"), two = c(2, 4, 2, 7, 5, 3), 
var1 = c("x1a", "x1b", "y1b", "x1c", "y1c", "x1d"), var2 = c("x2a", 
"x2b", "y2b", "x2c", "y2c", "x2d"), var3 = c("x3a", "x3b", 
"y3b", "x3c", "y3c", "x3d")), .Names = c("one", "two", "var1", 
"var2", "var3"), row.names = c(1L, 2L, 5L, 3L, 6L, 4L), class = "data.frame")

I am slightly familiar with what the cast and melt function from the reshape packages do, but was not able yet to figure out a way to reshape the DF in a smart way. For now the following provides the sate that I have gotten to:

df.between <- melt(df.in, id.vars=c("one", "two"))
df.between$variable <- gsub("x.|y.", "", df.between$variable)

Now the "variable" column does correctly identify the variable (1, 2 or 3). However, I was not able to cast this into the required form and this solution does not seem to be useful for larger sets of data due to the use of grepl.

Happy to get a nudge into the right direction here.

I guess you have some numbers in the `two` column wrong in the expected output. — akrun, Sep 18 '15 at 15:21

akrun · Answer 1 · 2015-09-18T15:29:11.443

5

We can use melt from the devel version of data.table i.e. v1.9.5, which can handle multiple patterns for the measure variables.

library(data.table)
melt(setDT(df1), measure=patterns('.1', '.2', '.3'),
      na.rm=TRUE, value.name=paste0('var', 1:3))[, variable:=NULL][order(one)]
#   one two var1 var2 var3
#1:   a   2  x1a  x2a  x3a
#2:   b   4  x1b  x2b  x3b
#3:   b   4  y1b  y2b  y3b
#4:   c   7  x1c  x2c  x3c
#5:   c   7  y1c  y2c  y3c
#6:   d   3  x1d  x2d  x3d

EDIT: We don't need c inside the patterns and it will also give exact matches (from @Jaap's comments).

edited Sep 18 '15 at 15:29

answered Sep 18 '15 at 15:15

akrun

874,273
37
540
662

1

`patterns(".1", ".2", ".3")` works as well – Jaap Sep 18 '15 at 15:26
@Jaap Yes, but wouldn't `.` means any character? – akrun Sep 18 '15 at 15:27
1

Apperently `patterns` looks for exact matches – Jaap Sep 18 '15 at 15:29
@Jaap The examples in the `?melt` suggest you are right. I didn't find the wording in the description. – akrun Sep 18 '15 at 15:32

score 3 · Answer 2 · edited Sep 18 '15 at 16:03

melt from "data.table" will be much faster than the following, but you could also consider merged.stack from my "splitstackshape" package:

library(splitstackshape)
na.omit(merged.stack(mydf, var.stubs = c(".1", ".2", ".3"),
                     sep = "var.stubs", atStart = FALSE))

#    one two .time_1  .1  .2  .3
# 1:   a   2       x x1a x2a x3a
# 2:   b   4       x x1b x2b x3b
# 3:   b   4       y y1b y2b y3b
# 4:   c   7       x x1c x2c x3c
# 5:   c   7       y y1c y2c y3c
# 6:   d   3       x x1d x2d x3d

Heroka · Answer 3 · 2015-09-18T15:32:00.877

You were almost there with the reshape-route, so I finished it for you. All you needed was a differentiation between x and y variables. (which are later easy to remove if you don't want or need them). I've left the missings in because they are easy to remove and to prevent silent removal of missing data.

df.between <- melt(df.in, id.vars=c("one", "two"))
#replace with 'var' so no numeric column names.
df.between$variable_n <- gsub("x.|y.", "var", df.between$variable)
df.between$variable_xy <- gsub(".[0-9]","",df.between$variable)

res <- dcast(one+two+variable_xy~variable_n,value.var="value",data=df.between)

    > res
  one two variable_xy var1 var2 var3
1   a   2           x  x1a  x2a  x3a
2   a   2           y <NA> <NA> <NA>
3   b   4           x  x1b  x2b  x3b
4   b   4           y  y1b  y2b  y3b
5   c   7           x  x1c  x2c  x3c
6   c   7           y  y1c  y2c  y3c
7   d   3           x  x1d  x2d  x3d
8   d   3           y <NA> <NA> <NA>

How to turn multiple columns into observations

3 Answers3