Elegant way to parse a 1-many column into a new data.frame in R

Question

I have some data read in from an excel spreadsheet where the curators are not aware of relational databases and handling 1-many relationships, so have put multiple variables in one column:

>df <- data.frame(id=c("X1", "X23", "X5"), vars=c("foo, bar, hello", "world", NA), var2=c(1,2,3))
>df
   id            vars var2
1  X1 foo, bar, hello    1
2 X23           world    2
3  X5            <NA>    3

I want to transform the vars column to a new data frame so I can have a 1-many relation:

>df
     id var2
X1   X1    1
X23 X23    2
X5   X5    3

>df2
   id   var
1  X1   foo
2  X1   bar
3  X1 hello
4 X23 world

I am able to parse the vars column into a list where each entry is a vector of variables:

>library(stringr)
>halfway <- str_split(df$vars, pattern=", ")
>halfway
[[1]]
[1] "foo"   "bar"   "hello"

[[2]]
[1] "world"

[[3]]
[1] NA

but I'm unsure how to take this list and transform it to a long data.frame.

I've had a play around, but I can't get it into the long format without losing information about the IDs each of the variables belongs to (using unlist). I've also looked at reshape but it doesn't seem to do what I want.

I could use a for loop to iteratively build up the new table, but that's horribly inefficient. Is there an elegant solution for this?

David Robinson · Accepted Answer · 2013-08-01T01:59:57.157

2

This could be done in a very straightforward way with the data.table package:

library(data.table)
dt = as.data.table(df)
df2 = dt[, list(var=str_split(vars, ", ")[[1]]), by=id]
df2 = df2[!is.na(var), ]

One advantage of this is that if you have multiple ID columns (say, id, id2, id3), you can just change it to

df2 = dt[, list(var=strsplit(vars, ", ")[[1]]), by=c("id", "id2", "id3")]

edited Aug 01 '13 at 01:59

answered Aug 01 '13 at 01:54

David Robinson

77,383
16
167
187

Thanks, Ill have to check out the data.table package more! – Scott Ritchie Aug 01 '13 at 03:10

score 2 · Answer 2 · edited May 23 '17 at 12:05

concat.split.multiple from my "splitstackshape" package has an option to perform the splitting and do the reshaping in one step, making the only thing left removing the rows with NA values:

library(splitstackshape)
out <- concat.split.multiple(df, "vars", ",", direction = "long")
out[complete.cases(out), ]
#    id var2 time  vars
# 1  X1    1    1   foo
# 2 X23    2    1 world
# 4  X1    1    2   bar
# 7  X1    1    3 hello

In other scenarios the "guts" of the function has fared quite well with respect to speed. I've never benchmarked this particular functionality though (mostly because I never really deal with very large datasets to begin with).

score 1 · Answer 3 · answered Aug 01 '13 at 02:23

The expand.grid function is often useful for reshaping data. For example:

> expand.grid(df[1,1],halfway[[1]])
  Var1  Var2
1   X1   foo
2   X1   bar
3   X1 hello

You can use apply to do this to each row of your data frame:

threequarterway <- lapply(seq(nrow(df)),function(i) expand.grid(df[i,1],halfway[[i]]))

and do.call to bind the resulting list elements into a single data frame:

df2 <- do.call(rbind,threequarterway)

Finally, get rid of the NA row as in David Robinson's answer:

df2 = df2[!is.na(df2[,2]),]

(David's answer appeared while I was typing this, and is probably a better method, but I thought you might like to know about expand.grid anyway.)

Elegant way to parse a 1-many column into a new data.frame in R

3 Answers3