4

I often have tables where a single cell may contain multiple values (divided by some character separator), and I need to split such records, for example:

dt1 <- fread("V1 V2 V3
             x b;c;d 1
             y d;ef  2
             z d;ef  3")

should give something like this:

#    V1 V2 V3
# 1:  x  b  1
# 2:  x  c  1
# 3:  x  d  1
# 4:  y  d  2
# 5:  y ef  2
# 6:  z  d  3
# 7:  z ef  3

So far I made the following function:

# I omit all error-checking code here and assume that
# dtInput   is a valid data.table and
# col2split is a name of existing column
splitcol2rows <- function(dtInput, col2split, sep){
  ori.names <- names(dtInput); # save original order of columns
  ori.keys  <-   key(dtInput); # save original keys

  # create new table with 2 columns:
  # one is original "un-splitted" column (will be later used as a key)
  # and second one is result of strsplit:
  dt.split <- dtInput[, 
                    .(tmp.add.col=rep(unlist(strsplit(get(col2split),sep,T)), .N)),
                    by=col2split]
  dt.split <- unique(dt.split, by=NULL);

  # now use that column as a key:
  setkeyv(dt.split, col2split)
  setkeyv(dtInput, col2split)
  dtInput <- dt.split[dtInput, allow.cartesian=TRUE];

  # leave only 'splitted' column
  dtInput[, c(col2split):=NULL]; 
  setnames(dtInput, 'tmp.add.col', col2split); 

  # restore original columns order and keys
  setcolorder(dtInput, ori.names);
  setkeyv(dtInput, ori.keys);

  return(dtInput);
}

it works fine (check the example output as splitcol2rows(dt1, 'V2', ';')[]), but I'm sure this solution is far from optimal and would be grateful for any advices. For example, I looked through the solution proposed by Matt in the answer to the question "Applying a function to each row of a data.table" and I like that it manages without creating intermediate table (my dt.split), but in my case I need to keep all other columns and don't see how to do that otherwise.


UPD. First, staring from the solution proposed by @RichardScriven, I came to re-writing my function so it became much shorter and easier to read:

splitcol2rows_mget <- function(dtInput, col2split, sep){
  dtInput <- dtInput[, .(tmp.add.col = unlist(strsplit(get(col2split),sep,T))), by=names(dtInput)]

  dtInput[, c(col2split):=NULL];
  setnames(dtInput, 'tmp.add.col', col2split); 
  return(dtInput);
}

It still has some ugly pieces, like intermediate 'tmp.add.col' column which might cause conflict if such columns already existed in the original table. In addition, this shorter solution turned out to work slower than my first code. And both of them are slower than cSplit() from splitstackshape package:

require('microbenchmark')
require('splitstackshape')

splitMy1 <- function(input){return(splitcol2rows(input, col2split = 'V2', sep = ';'))}
splitMy2 <- function(input){return(splitcol2rows_mget(input, col2split = 'V2', sep = ';'))}
splitSH  <- function(input){return(cSplit(input, splitCols = 'V2', sep = ';', direction = 'long'))}

# Smaller table, 100 repeats:
set.seed(1)
num.rows <- 1e4;
dt1 <- data.table(V1=seq_len(num.rows),
                  V2=replicate(num.rows,paste0(sample(letters, runif(1,1,6), T), collapse = ";")),
                  V3=rnorm(num.rows))
print(microbenchmark(splitMy1(dt1), splitMy2(dt1), splitSH(dt1), times=100L))
#Unit: milliseconds
#          expr      min       lq     mean   median       uq       max neval
# splitMy1(dt1) 56.34475 58.53842 68.11128 62.51419 79.79727  98.96797   100
# splitMy2(dt1) 61.84215 64.59619 76.41503 69.02970 88.49229 132.43679   100
#  splitSH(dt1) 31.29671 33.14389 38.28108 34.91696 39.31291  83.58625   100    

# Bigger table, 1 repeat:
set.seed(1)
num.rows <- 5e5;
dt1 <- data.table(V1=seq_len(num.rows),
                  V2=replicate(num.rows,paste0(sample(letters, runif(1,1,6), T), collapse = ";")),
                  V3=rnorm(num.rows))
print(microbenchmark(splitMy1(dt1), splitMy2(dt1), splitSH(dt1), times=1L))

#Unit: seconds
#          expr      min       lq     mean   median       uq      max neval
# splitMy1(dt1) 2.955825 2.955825 2.955825 2.955825 2.955825 2.955825     1
# splitMy2(dt1) 3.693612 3.693612 3.693612 3.693612 3.693612 3.693612     1
#  splitSH(dt1) 1.990201 1.990201 1.990201 1.990201 1.990201 1.990201     1
Community
  • 1
  • 1
Vasily A
  • 8,256
  • 10
  • 42
  • 76
  • 3
    The answer to [a question I asked](http://stackoverflow.com/questions/33288695/how-to-use-tidyrseparate-when-the-number-of-needed-variables-is-unknown) a while back might give you the answer you're looking for. – tblznbits Jan 11 '16 at 01:49
  • @brittenb, thanks a lot, `cSplit()` does the job indeed! maybe I'll just do a little benchmarking to check if their code is as fast for big tables ) – Vasily A Jan 11 '16 at 02:25
  • Well you could get all columns with `dt1[, c(.(V2 = scan(text = V2, sep = ";", what = "")), mget(names(dt1)[-(1:2)])), by = V1]` but at that point I'd just go with `cSplit()` ;-) – Rich Scriven Jan 11 '16 at 02:41
  • @RichardScriven, thanks for the input! but unfortunately this code is extremely slow for big tables. I will add more details in a while. – Vasily A Jan 11 '16 at 03:26
  • @brittenb, maybe you post your comment about `cSplit()` as an answer? I will be glad to mark it as a solution. I will also add my benchmarking results. – Vasily A Jan 11 '16 at 03:49
  • I've updated my question with another solution and benchmarking, feel free to comment if you have something to add. – Vasily A Jan 11 '16 at 05:01

1 Answers1

6

There's a function in the package splitstackshape called cSplit which is perfectly suited for this task. Simply pass ";" as the separator and "long" as the direction to get what we need.

> library(splitstackshape)
> dat <- data.frame(V1 = c("x", "y", "z"), V2 = c("b;c;d", "d;ef", "d;ef"), V3 = 1:3, stringsAsFactors = FALSE)
> cSplit(dat, "V2", sep = ";", direction = "long")
#   V1 V2 V3
# 1:  x  b  1
# 2:  x  c  1
# 3:  x  d  1
# 4:  y  d  2
# 5:  y ef  2
# 6:  z  d  3
# 7:  z ef  3
tblznbits
  • 6,602
  • 6
  • 36
  • 66