0

Warning: as it turns out, my original version has some scoping issues and also doesn't achieve the goal it is supposed to.


I have a data set of 100000 (ten thousand) records that I would like to split up in to multiple rows. Every record has a field that contains a string with names of 8 items separated by a semicolon (;). The end result is to have 8 rows for every 1 row of original data.

I have written the following function to help me achieve this but it doesn't seem to be very efficient which in turn means that it takes impossibly long to be execute (I have let it run for atleast 30 minutes and it still wasn't done). So I'm looking for tips to improve the run time in any way whatsoever.

A little bit of context:

row[1] is the semicolon separated string of items.

row[5] is the index of the collection of items that has to be kept with the separate item to be able to relate them later.

toSingleItems <- function(data, sep = ';') {
  returnVal <- vector("list", nrows(data) * 8)
  i <- 1

  apply(data, 1, FUN = function(row) {
    splitDeck = str_split(row[1], sep)
    lapply(splitDeck, FUN=function(item){
      returnVal[[i]] <- c(row[5], item)
      i <- i + 1
    })
  })
  return(returnVal)
}

Any tips are welcome, thanks in advance!

Sneaky edit: the obvious solution is of course to reduce the data set in any way. I have done this (to 10000) but even then the performance is still pretty damn bad.

The data could look as follows:

"a;b;c;d;w;x;y;z"
"e;f;g;h;i;j;k;l"

The output in this scenario would look like this:

1, "a"
1, "b"
1, "c"
1, "d"
1, "w"
1, "x"
1, "y"
1, "z"
2, "e"
2, "f"
2, "g"
2, "h"
2, "i"
2, "j"
2, "k"
2, "l"
yarwest
  • 873
  • 8
  • 19
  • It would be easier to help if you could include some sample of data with expected output. From the description it seems you need something like this? https://stackoverflow.com/questions/15347282/split-delimited-strings-in-a-column-and-insert-as-new-rows – Ronak Shah Jun 09 '19 at 13:34
  • @RonakShah I added example in and out put, I'll try experimenting with info from the other question. – yarwest Jun 09 '19 at 15:27

2 Answers2

1

Here's a simpler example that might represent your data?

df = data.frame(x = c("a;b;c;d", "e;f;g;h"), y = 1:2, stringsAsFactors=FALSE)

To split x, I did

elts = strsplit(df$x, ";")

We need to know the length of each element, which is lengths(elts). I then put things back together

data.frame(x = unlist(elts), y = rep(y, lengths(elts))

Maybe a generalization would be to do the split and calculate lengths as above, replicate the whole data set

idx = rep(seq_len(nrow(df)), lengths(elts))
df = df[idx,]

and update the relevant column

df$x = unlist(elts)

The package tidyr offers a very convenient function

df %>% tidyr::separate_rows(x, sep = ";")
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
1

See whether the below code helps.

library(tidyverse)
df = data.frame(id= c(1,2), value=c("a;b;c;d;w;x;y;z", "e;f;g;h;i;j;k;l"))

df = df %>% 
  mutate(value = str_split(value,pattern=";", n=8)) %>% # transform the string to list
  unnest # list to individual rows
Theo
  • 575
  • 3
  • 8