Warning: as it turns out, my original version has some scoping issues and also doesn't achieve the goal it is supposed to.
I have a data set of 100000 (ten thousand) records that I would like to split up in to multiple rows. Every record has a field that contains a string with names of 8 items separated by a semicolon (;). The end result is to have 8 rows for every 1 row of original data.
I have written the following function to help me achieve this but it doesn't seem to be very efficient which in turn means that it takes impossibly long to be execute (I have let it run for atleast 30 minutes and it still wasn't done). So I'm looking for tips to improve the run time in any way whatsoever.
A little bit of context:
row[1]
is the semicolon separated string of items.
row[5]
is the index of the collection of items that has to be kept with the separate item to be able to relate them later.
toSingleItems <- function(data, sep = ';') {
returnVal <- vector("list", nrows(data) * 8)
i <- 1
apply(data, 1, FUN = function(row) {
splitDeck = str_split(row[1], sep)
lapply(splitDeck, FUN=function(item){
returnVal[[i]] <- c(row[5], item)
i <- i + 1
})
})
return(returnVal)
}
Any tips are welcome, thanks in advance!
Sneaky edit: the obvious solution is of course to reduce the data set in any way. I have done this (to 10000) but even then the performance is still pretty damn bad.
The data could look as follows:
"a;b;c;d;w;x;y;z"
"e;f;g;h;i;j;k;l"
The output in this scenario would look like this:
1, "a"
1, "b"
1, "c"
1, "d"
1, "w"
1, "x"
1, "y"
1, "z"
2, "e"
2, "f"
2, "g"
2, "h"
2, "i"
2, "j"
2, "k"
2, "l"