0

I have some data read in from an excel spreadsheet where the curators are not aware of relational databases and handling 1-many relationships, so have put multiple variables in one column:

>df <- data.frame(id=c("X1", "X23", "X5"), vars=c("foo, bar, hello", "world", NA), var2=c(1,2,3))
>df
   id            vars var2
1  X1 foo, bar, hello    1
2 X23           world    2
3  X5            <NA>    3

I want to transform the vars column to a new data frame so I can have a 1-many relation:

>df
     id var2
X1   X1    1
X23 X23    2
X5   X5    3

>df2
   id   var
1  X1   foo
2  X1   bar
3  X1 hello
4 X23 world

I am able to parse the vars column into a list where each entry is a vector of variables:

>library(stringr)
>halfway <- str_split(df$vars, pattern=", ")
>halfway
[[1]]
[1] "foo"   "bar"   "hello"

[[2]]
[1] "world"

[[3]]
[1] NA

but I'm unsure how to take this list and transform it to a long data.frame.

I've had a play around, but I can't get it into the long format without losing information about the IDs each of the variables belongs to (using unlist). I've also looked at reshape but it doesn't seem to do what I want.

I could use a for loop to iteratively build up the new table, but that's horribly inefficient. Is there an elegant solution for this?

Scott Ritchie
  • 10,293
  • 3
  • 28
  • 64

3 Answers3

2

This could be done in a very straightforward way with the data.table package:

library(data.table)
dt = as.data.table(df)
df2 = dt[, list(var=str_split(vars, ", ")[[1]]), by=id]
df2 = df2[!is.na(var), ]

One advantage of this is that if you have multiple ID columns (say, id, id2, id3), you can just change it to

df2 = dt[, list(var=strsplit(vars, ", ")[[1]]), by=c("id", "id2", "id3")]
David Robinson
  • 77,383
  • 16
  • 167
  • 187
2

concat.split.multiple from my "splitstackshape" package has an option to perform the splitting and do the reshaping in one step, making the only thing left removing the rows with NA values:

library(splitstackshape)
out <- concat.split.multiple(df, "vars", ",", direction = "long")
out[complete.cases(out), ]
#    id var2 time  vars
# 1  X1    1    1   foo
# 2 X23    2    1 world
# 4  X1    1    2   bar
# 7  X1    1    3 hello

In other scenarios the "guts" of the function has fared quite well with respect to speed. I've never benchmarked this particular functionality though (mostly because I never really deal with very large datasets to begin with).

Community
  • 1
  • 1
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
1

The expand.grid function is often useful for reshaping data. For example:

> expand.grid(df[1,1],halfway[[1]])
  Var1  Var2
1   X1   foo
2   X1   bar
3   X1 hello

You can use apply to do this to each row of your data frame:

threequarterway <- lapply(seq(nrow(df)),function(i) expand.grid(df[i,1],halfway[[i]]))

and do.call to bind the resulting list elements into a single data frame:

df2 <- do.call(rbind,threequarterway)

Finally, get rid of the NA row as in David Robinson's answer:

df2 = df2[!is.na(df2[,2]),]

(David's answer appeared while I was typing this, and is probably a better method, but I thought you might like to know about expand.grid anyway.)

Alexander Hanysz
  • 791
  • 5
  • 15