How to strsplit data frame column and replicate rows accordingly?

Question

I have a data frame like this:

> df <- data.frame(Column1=c("id1", "id2", "id3"), Column2=c("text1,text2,text3", "text4", "text5,text6"), Column3=c("text7", "text8,text9,text10,text11", "text12,text13"))

> df
  Column1           Column2                   Column3
1     id1 text1,text2,text3                     text7
2     id2             text4 text8,text9,text10,text11
3     id3       text5,text6             text12,text13

How do I transform it in this format?

  Column1 variable                     value
1     id1  Column2                     text1
2     id1  Column2                     text2
3     id1  Column2                     text3
4     id2  Column2                     text4
5     id3  Column2                     text5
6     id3  Column2                     text6
7     id1  Column3                     text7
8     id2  Column3                     text8
9     id2  Column3                     text9
10    id2  Column3                    text10
11    id2  Column3                    text11
12    id3  Column3                    text12
13    id3  Column3                    text13

I guess the first step is to melt() the data frame (btw, should I worry about that warning?):

> library(reshape2)    
> mdf <- melt(df, id.vars="Column1", measure.vars=c("Column2", "Column3"))
> mdf
  Column1 variable                     value
1     id1  Column2         text1,text2,text3
2     id2  Column2                     text4
3     id3  Column2               text5,text6
4     id1  Column3                     text7
5     id2  Column3 text8,text9,text10,text11
6     id3  Column3             text12,text13
Warning message:
attributes are not identical across measure variables; they will be dropped

Then I would basically need to ``strsplit()` the 'value' column and replicate the rows accordingly, but I can't think of a way to do it.

> strsplit(mdf$value, ",")
[[1]]
[1] "text1" "text2" "text3"

[[2]]
[1] "text4"

[[3]]
[1] "text5" "text6"

[[4]]
[1] "text7"

[[5]]
[1] "text8"  "text9"  "text10" "text11"

[[6]]
[1] "text12" "text13"

Any help is appreciated! Thanks.

Jaap · Answer 1 · 2016-11-26T14:56:29.490

A data.table solution:

library(data.table)
mdt <- melt(setDT(df), id.vars="Column1")[,strsplit(as.character(value),",",fixed=TRUE),
                                          by=list(Column1,variable)]

the result:

> mdt
    Column1 variable     V1
 1:     id1  Column2  text1
 2:     id1  Column2  text2
 3:     id1  Column2  text3
....

You can also use the tstrsplit function from the latest version of data.table (v1.9.5+) which keeps the name for the value column instead of renaming it to V1:

mdt <- melt(setDT(df), id.vars="Column1")[,lapply(.SD, function(x) tstrsplit(x, ",", fixed=TRUE)),
                                          by=list(Column1,variable)]

the result:

> mdt
    Column1 variable  value
 1:     id1  Column2  text1
 2:     id1  Column2  text2
 3:     id1  Column2  text3
....

An alternative solution with dplyr & tidyr:

library(dplyr)
library(tidyr)
mdf <- df %>% gather(variable, value, -Column1) %>% 
  transform(value = strsplit(as.character(value),",")) %>%
  unnest(value)

the result:

> mdf
   Column1 variable  value
1      id1  Column2  text1
2      id1  Column2  text2
3      id1  Column2  text3
....

With the latest version of tidyr, you can also use the separate_rows-function:

mdf <- df %>% 
  gather(variable, value, -Column1) %>% 
  separate_rows(value)

The "data.table" approach is pretty much what `cSplit` does, but with a few other options (like letting the split data be wide format). — A5C1D2H2I1M1N2O1R2T1, Jul 06 '14 at 13:54
data.table and dplyr are often overlapping in function. Is there a way to achieve the same result with dplyr? — enricoferrero, Jul 06 '14 at 20:40
@enrico16 I've updated my answer by including a `dplyr` solution and I've also updated the `data.table` solution. — Jaap, Aug 03 '15 at 10:06

score 4 · Accepted Answer · edited Jul 07 '14 at 10:50

4

You could try:

 library(reshape2)

cSplit from https://gist.github.com/mrdwab/11380733

 cSplit(melt(df, id.vars="Column1"), "value", ",", "long")
 #      Column1 variable  value
 # 1:     id1  Column2  text1
 # 2:     id1  Column2  text2
 # 3:     id1  Column2  text3
 # 4:     id2  Column2  text4
 # 5:     id3  Column2  text5
 # 6:     id3  Column2  text6
 # 7:     id1  Column3  text7
 # 8:     id2  Column3  text8
 # 9:     id2  Column3  text9
 #10:     id2  Column3 text10
 #11:     id2  Column3 text11
 #12:     id3  Column3 text12
 #13:     id3  Column3 text13

Alternatively, if one wants to stick to functions available in CRAN packages:

library(reshape2)
library(splitstackshape)
library(dplyr)
select(na.omit(concat.split.multiple(melt(df, id.vars="Column1"), split.col="value", sep=",", direction="long")), -time)

edited Jul 07 '14 at 10:50

enricoferrero

2,249
1
23
28

answered Jul 06 '14 at 11:56

akrun

874,273
37
540
662

This answer rocks!!! :-) But, I would have probably done `cSplit(melt(df, id.vars="Column1"), "value", ",", "long")` instead. – A5C1D2H2I1M1N2O1R2T1 Jul 06 '14 at 13:52
@Ananda Mahto, Thanks. I updated the code with the compact version you suggested. When I run the code for the first time, it gave a Warning message; Warning message: attributes are not identical across measure variables; they will be dropped. Now, I am not getting the message. – akrun Jul 06 '14 at 14:03
@AnandaMahto, is cSplit() part of a package? I found splitstackshape::concat.split.multiple() that works similarly but annoyingly inserts a 'time' column when using direction='long'. – enricoferrero Jul 06 '14 at 15:48
@Enrico, not yet. It will replace `concat.split.multiple` once I get some time to figure out what needs to be pruned from 'splitstackshape'. The time variable you mention is a side effect of the `reshape` function. – A5C1D2H2I1M1N2O1R2T1 Jul 06 '14 at 15:55
`cSplit()` was released in 1.4.0 (Oct 2014) http://www.r-bloggers.com/splitstackshape-v1-4-0-for-r/ – smci Feb 27 '15 at 12:39

score 2 · Answer 3 · answered Jul 06 '14 at 12:22

You got this far:

mdf <- melt(df, id.vars="Column1", measure.vars=c("Column2", "Column3"))
values <- strsplit(mdf$value, ",")

Now all you need to do is to create an index of which rows of mdf to use:

n <- vapply(values, length, integer(1))
index <- rep.int(seq_along(n), n)

and then combine that with values:

cbind(mdf[index,], unlist(values, use.names = FALSE))

score 0 · Answer 4 · edited Jul 17 '14 at 11:51

0

About the Warning: it appears because you are using factor variables for the melting.

In your example you can avoid the warning adding stringAsFactors=FALSE at the end of the df declaration:

df <- data.frame(Column1=c("id1", "id2", "id3"), Column2=c("text1,text2,text3", "text4", "text5,text6"), Column3=c("text7", "text8,text9,text10,text11", "text12,text13"), stringsAsFactors=FALSE)

edited Jul 17 '14 at 11:51

sunbabaphu

1,473
1
10
15

answered Jul 17 '14 at 11:41

franzz2000

1
2

How to strsplit data frame column and replicate rows accordingly?

4 Answers4

Linked