Splitting strings into multiple rows in R

Question

This has been asked in various places for SQL, but can't find a simple way to do it in R

I have a dataset like this

                                                             Ids    v1  v2  v3  v4  v5
548|14721,678|17604,716|18316,732|18505,745|18626,752|18716 9186    639 9045    316 28396
548|14721,678|17603,716|18316,732|18507,745|18626,752|18716 9041    598 8897    283 28054
548|14722,678|17603,716|18316,732|18507,745|18626,752|18716 8799    588 8669    246 27433
548|14721,678|17603,716|18316,732|18505,745|18626,752|18715 8914    614 8765    273 27347
716|18316,745|18626                                         4113    497 4050    270 27267
548|14722,678|17604,716|18316,732|18507,745|18626,752|18716 8829    589 8713    254 25270

I want to split based on the first column on comma and create a row for each entry, repeating the entries in the rest of the rows.

The function I have written is

split.data <- function(data, split = ",") {
  y <- NULL
  for( i in 1:nrow(data)){
    y <- rbind(y,cbind(data.frame(unlist(strsplit(as.character(data[i,1]), split = split))),data[i,-1]))
  }
  names(y) <- names(data)
  y <- sapply(y,as.character)
  return(as.data.frame(y))
}

This works but its extremely slow. Is there a way to vectorize this and make it faster. The file I have has over 5000 such rows and it takes a while to run.

Related articles I have found for SQL: Split values over multiple rows Turning a Comma Separated string into individual rows

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2013-10-21T02:54:16.253

Updated answer (21 Oct 2013)

These steps can be combined if you use concat.split.multiple from my "splitstackshape" package. This function makes use of count.fields automatically, so it shouldn't suffer from the problem mentioned in the comments:

library(splitstackshape)
out <- concat.split.multiple(mydf, "Ids", seps=",", "long")
head(out)
#     v1  v2   v3  v4    v5 time       Ids
# 1 9186 639 9045 316 28396    1 548|14721
# 2 9041 598 8897 283 28054    1 548|14721
# 3 8799 588 8669 246 27433    1 548|14722
# 4 8914 614 8765 273 27347    1 548|14721
# 5 4113 497 4050 270 27267    1 716|18316
# 6 8829 589 8713 254 25270    1 548|14722
tail(out)
#      v1  v2   v3  v4    v5 time       Ids
# 31 9186 639 9045 316 28396    6 752|18716
# 32 9041 598 8897 283 28054    6 752|18716
# 33 8799 588 8669 246 27433    6 752|18716
# 34 8914 614 8765 273 27347    6 752|18715
# 35 4113 497 4050 270 27267    6      <NA>
# 36 8829 589 8713 254 25270    6 752|18716

Original Answer (27 Feb 2013)

You need to do a combination of splitting the "Ids" strings followed by "reshaping" your data (if I understand your goal correctly).

Instead of splitting the way you did, I've gone ahead and taken advantage of read.csv and the argument fill = TRUE. Note that if the values in your "Ids" column are currently factors, you'll need to use text = as.character(mydf$Ids) instead.

Here's how I'd proceed:

mydf2 <- cbind(read.csv(text = mydf$Ids, fill = TRUE, header = FALSE), mydf[-1])
mydf2
#          V1        V2        V3        V4        V5        V6   v1  v2   v3  v4    v5
# 1 548|14721 678|17604 716|18316 732|18505 745|18626 752|18716 9186 639 9045 316 28396
# 2 548|14721 678|17603 716|18316 732|18507 745|18626 752|18716 9041 598 8897 283 28054
# 3 548|14722 678|17603 716|18316 732|18507 745|18626 752|18716 8799 588 8669 246 27433
# 4 548|14721 678|17603 716|18316 732|18505 745|18626 752|18715 8914 614 8765 273 27347
# 5 716|18316 745|18626                                         4113 497 4050 270 27267
# 6 548|14722 678|17604 716|18316 732|18507 745|18626 752|18716 8829 589 8713 254 25270

That data is currently in a "wide" format. Let's make it into a "long" format. The reshape function needs a little bit of information on how to proceed. In particular, it needs to know:

Which columns are indicative of "id" variables. Counter-intuitively to the names from your dataset, these are not the values from your original "Ids" variable, but the other variables that were present. ("v1" - "v5" (lower-case "v") at positions 7 through 11 in the "mydf2" dataset). Obviously, for your actual data, you need to specify the actual columns that are to be treated as idvars.
Which columns "vary" and need to be "stacked" in the "long" format. In this case, it's the new variables that were created when we used read.csv, and by manually checking their indexes, we can see that they are in positions 1 through 6. Obviously, you would need to specify the column numbers from your actual dataset.

For the example dataset you provided, we would use reshape as follows:

mydf3 <- reshape(mydf2, direction = "long", idvar=7:ncol(mydf2), 
                 varying=1:6, sep = "")
rownames(mydf3) <- NULL

Here's the head and tail of the resulting data.frame:

> head(mydf3)
    v1  v2   v3  v4    v5 time         V
1 9186 639 9045 316 28396    1 548|14721
2 9041 598 8897 283 28054    1 548|14721
3 8799 588 8669 246 27433    1 548|14722
4 8914 614 8765 273 27347    1 548|14721
5 4113 497 4050 270 27267    1 716|18316
6 8829 589 8713 254 25270    1 548|14722
> tail(mydf3)
     v1  v2   v3  v4    v5 time         V
31 9186 639 9045 316 28396    6 752|18716
32 9041 598 8897 283 28054    6 752|18716
33 8799 588 8669 246 27433    6 752|18716
34 8914 614 8765 273 27347    6 752|18715
35 4113 497 4050 270 27267    6          
36 8829 589 8713 254 25270    6 752|18716

This is all assuming we are starting with an object named "mydf" that looks like this:

mydf <- structure(list(Ids = c("548|14721,678|17604,716|18316,732|18505,745|18626,752|18716", 
  "548|14721,678|17603,716|18316,732|18507,745|18626,752|18716", 
  "548|14722,678|17603,716|18316,732|18507,745|18626,752|18716", 
  "548|14721,678|17603,716|18316,732|18505,745|18626,752|18715", 
  "716|18316,745|18626", "548|14722,678|17604,716|18316,732|18507,745|18626,752|18716"
  ), v1 = c(9186L, 9041L, 8799L, 8914L, 4113L, 8829L), v2 = c(639L, 
  598L, 588L, 614L, 497L, 589L), v3 = c(9045L, 8897L, 8669L, 8765L, 
  4050L, 8713L), v4 = c(316L, 283L, 246L, 273L, 270L, 254L), v5 = c(28396L, 
  28054L, 27433L, 27347L, 27267L, 25270L)), .Names = c("Ids", "v1", 
  "v2", "v3", "v4", "v5"), class = "data.frame", row.names = c(NA, 
  -6L))

+1! good use of fill=TRUE to deal with the problem at source. — agstudy, Feb 26 '13 at 19:26
Great!. Your solution works. I will need to make some changes as its not certain that there will always be 6 columns after splitting. I have updated the request to the next step. Let me know if that makes any difference. — Rohit Das, Feb 26 '13 at 19:59
I found an issue with using the read.csv approach. It worked on the sample data I provided, but when applied to the full data there were rows with more than 6 ids in the string and instead of adding another column, it adds another row and thus the cbind fails. — Rohit Das, Feb 26 '13 at 20:20
@RohitDas My guess is that `read.csv` assumes that there are 5 fields from the first few lines of input, and then fills in anything that is less than 5. If it's greater than 5, it assumes that there's a new line. If you know what the maximum number of ID's is(from something like `max(sapply(lapply(df$Ids,strsplit,split=','),length))`), then you can add `colClasses=rep.int(NA_character,6)` (or whatever the max number of IDs is) as an argument to `read.csv`. — Blue Magister, Feb 26 '13 at 20:24
@RohitDas, just modify the reshape command to correctly identify the column numbers for the columns which are not varying (lower case v1-v5) as the idvars, and the column numbers for the columns which need to be stacked as the varying argument. — A5C1D2H2I1M1N2O1R2T1, Feb 26 '13 at 21:28
just an update that the function `concat.split.multiple` is now updated to `cSplit`. — Hardik Gupta, Sep 17 '18 at 09:19

Splitting strings into multiple rows in R

1 Answers1

Updated answer (21 Oct 2013)

Original Answer (27 Feb 2013)