0

I have a dataset that looks like this:

C1     C2
A,B,C  1
A      2
D,A    3

I want to transform this to:

 C1     C2
 A      1
 B      1
 C      1
 A      2
 D      3
 A      3

I currently have no idea as to how to proceed or even exactly what to search for. I had been trying to formulate something using data.table but could not achieve anything remotely useful. Please advise on how this can be achieved.

To Mods: Please merge with similar questions if you can find any. I couldn't find any but strongly suspect this to be a fairly common question.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Arani
  • 753
  • 1
  • 9
  • 23
  • Sorry for the bad title, couldn't understand how to describe this specific problem. – Arani Dec 03 '18 at 04:57
  • 1
    @RonakShah, I knew it was covered elsewhere, good of you to find it! – r2evans Dec 03 '18 at 05:39
  • Thanks Ronak. I couldn't understand how to phrase the search, which is why I couldn't find it. – Arani Dec 03 '18 at 06:38
  • 1
    Original Q with benchmark: https://stackoverflow.com/questions/13773770/split-comma-separated-strings-in-a-column-into-separate-rows. Thanks Ronak again for helping me find it. – Arani Dec 03 '18 at 06:41

2 Answers2

1
x <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
 C1     C2
 A,B,C  1
 A      2
 D,A    3")


C1s <- strsplit(x$C1, ",")
C1s
# [[1]]
# [1] "A" "B" "C"
# [[2]]
# [1] "A"
# [[3]]
# [1] "D" "A"
data.frame(C1=unlist(C1s), C2=rep(x$C2, times=lengths(C1s)), stringsAsFactors=FALSE)
#   C1 C2
# 1  A  1
# 2  B  1
# 3  C  1
# 4  A  2
# 5  D  3
# 6  A  3
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks, I prefer the other answer for preciseness, but yours reminded me of the possible complications regarding factors in a column (I usually read my tables as character to avoid that). – Arani Dec 03 '18 at 06:36
  • It's completely your choice, though I don't see how either answer is any more precise than the other: both use `strsplit`, both guard against `factor`, and the only key difference is the use of `lengths` vs `str_count`. – r2evans Dec 03 '18 at 19:57
  • 1
    You are right. I just liked how it was in basically one line of code. There are not much difference (speed will be almost exactly the same anyway). I took the one liner approach and expanded on it to include some other transformations. I learnt that I can use almost anything inside the data.table except for loops (made a function for that). Thanks. – Arani Dec 03 '18 at 20:21
1

You can do it like this:

df = data.frame(C1 = c("A,B,C", "A", "D,A"), C2=c(1,2,3))

library(stringr)
df2 <- data.frame(C1 = unlist(strsplit(as.character(df$C1), ",")), 
                  C2 = rep(df$C2, str_count(df$C1, ',')+1))
Taher A. Ghaleb
  • 5,120
  • 5
  • 31
  • 44