Creating unique rows by breaking a column with comma separated values

Question

I have a dataset that looks like this:

C1     C2
A,B,C  1
A      2
D,A    3

I want to transform this to:

I currently have no idea as to how to proceed or even exactly what to search for. I had been trying to formulate something using data.table but could not achieve anything remotely useful. Please advise on how this can be achieved.

To Mods: Please merge with similar questions if you can find any. I couldn't find any but strongly suspect this to be a fairly common question.

Sorry for the bad title, couldn't understand how to describe this specific problem. — Arani, Dec 03 '18 at 04:57
@RonakShah, I knew it was covered elsewhere, good of you to find it! — r2evans, Dec 03 '18 at 05:39
Thanks Ronak. I couldn't understand how to phrase the search, which is why I couldn't find it. — Arani, Dec 03 '18 at 06:38
Original Q with benchmark: https://stackoverflow.com/questions/13773770/split-comma-separated-strings-in-a-column-into-separate-rows. Thanks Ronak again for helping me find it. — Arani, Dec 03 '18 at 06:41

score 1 · Answer 1 · answered Dec 03 '18 at 05:16

1

x <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
 C1     C2
 A,B,C  1
 A      2
 D,A    3")


C1s <- strsplit(x$C1, ",")
C1s
# [[1]]
# [1] "A" "B" "C"
# [[2]]
# [1] "A"
# [[3]]
# [1] "D" "A"
data.frame(C1=unlist(C1s), C2=rep(x$C2, times=lengths(C1s)), stringsAsFactors=FALSE)
#   C1 C2
# 1  A  1
# 2  B  1
# 3  C  1
# 4  A  2
# 5  D  3
# 6  A  3

answered Dec 03 '18 at 05:16

r2evans

141,215
6
77
149

Thanks, I prefer the other answer for preciseness, but yours reminded me of the possible complications regarding factors in a column (I usually read my tables as character to avoid that). – Arani Dec 03 '18 at 06:36
It's completely your choice, though I don't see how either answer is any more precise than the other: both use `strsplit`, both guard against `factor`, and the only key difference is the use of `lengths` vs `str_count`. – r2evans Dec 03 '18 at 19:57
1

You are right. I just liked how it was in basically one line of code. There are not much difference (speed will be almost exactly the same anyway). I took the one liner approach and expanded on it to include some other transformations. I learnt that I can use almost anything inside the data.table except for loops (made a function for that). Thanks. – Arani Dec 03 '18 at 20:21

score 1 · Accepted Answer · answered Dec 03 '18 at 05:16

1

You can do it like this:

df = data.frame(C1 = c("A,B,C", "A", "D,A"), C2=c(1,2,3))

library(stringr)
df2 <- data.frame(C1 = unlist(strsplit(as.character(df$C1), ",")), 
                  C2 = rep(df$C2, str_count(df$C1, ',')+1))

answered Dec 03 '18 at 05:16

Taher A. Ghaleb

5,120
5
31
44

working like a charm. Thanks. – Arani Dec 03 '18 at 06:35

Creating unique rows by breaking a column with comma separated values

2 Answers2