As I mentioned in my comment, if your data are balanced (that is, you expect a nice rectangular dataset after splitting the data) you should look at my concat.split.DT
function.
Here are some tests.
Sven's data, but with 20K rows instead of 2
dat <- do.call(rbind, replicate(1e4, dat, simplify=FALSE))
dim(dat)
# [1] 20000 1
The "stringr" functions are likely to be a bit slow:
library(stringr)
system.time(do.call(rbind, str_split(dat$a, "/")))
# user system elapsed
# 3.194 0.000 3.211
But how do the other solutions fare?
fun1 <- function() concat.split.multiple(dat, "a", "/")
fun2 <- function() do.call(rbind, strsplit(dat$a, "/", fixed=TRUE))
## ^^ fixed = TRUE will make a big difference
fun3 <- function() concat.split.DT(dat, "a", "/")
library(microbenchmark)
microbenchmark(fun1(), fun2(), fun3(), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 530.46597 534.13486 535.19139 538.91488 553.61919 10
# fun2() 30.22265 31.07287 31.81474 32.93936 40.28859 10
# fun3() 22.57517 22.94169 23.10297 23.30907 31.97640 10
So, that's about half a second for the regular concat.split.multiple
(which just uses read.table
under the hood), and much better results for strsplit
and concat.split.DT
(the latter of which uses fread
from "data.table" under the hood).
Let's scale it up even more, to 1 million rows now...
dat <- do.call(rbind, replicate(50, dat, simplify=FALSE))
dim(dat)
# [1] 1000000 1
microbenchmark(fun2(), fun3(), times = 5)
# Unit: seconds
# expr min lq median uq max neval
# fun2() 6.257892 6.522199 13.728283 13.934860 14.277432 5
# fun3() 1.671739 1.830485 2.203076 2.470872 2.572917 5
The advantage of the concat.split.DT
approach is the convenience of splitting multiple columns with a simple syntax:
dat2 <- do.call(cbind, replicate(5, dat, simplify = FALSE))
dim(dat2)
# [1] 1000000 5
names(dat2) <- make.unique(names(dat2))
head(dat2)
# a a.1 a.2 a.3 a.4
# 1 a/b/c/d a/b/c/d a/b/c/d a/b/c/d a/b/c/d
# 2 e/f/g/h e/f/g/h e/f/g/h e/f/g/h e/f/g/h
# 3 a/b/c/d a/b/c/d a/b/c/d a/b/c/d a/b/c/d
# 4 e/f/g/h e/f/g/h e/f/g/h e/f/g/h e/f/g/h
# 5 a/b/c/d a/b/c/d a/b/c/d a/b/c/d a/b/c/d
# 6 e/f/g/h e/f/g/h e/f/g/h e/f/g/h e/f/g/h
Now, let's split all of them at once:
system.time(out <- concat.split.DT(dat2, names(dat2), "/"))
# user system elapsed
# 6.260 0.040 6.532
out
# a_1 a_2 a_3 a_4 a.1_1 a.1_2 a.1_3 a.1_4 a.2_1 a.2_2 a.2_3 a.2_4 a.3_1
# 1: a b c d a b c d a b c d a
# 2: e f g h e f g h e f g h e
# 3: a b c d a b c d a b c d a
# 4: e f g h e f g h e f g h e
# 5: a b c d a b c d a b c d a
# ---
# 999996: e f g h e f g h e f g h e
# 999997: a b c d a b c d a b c d a
# 999998: e f g h e f g h e f g h e
# 999999: a b c d a b c d a b c d a
# 1000000: e f g h e f g h e f g h e
# a.3_2 a.3_3 a.3_4 a.4_1 a.4_2 a.4_3 a.4_4
# 1: b c d a b c d
# 2: f g h e f g h
# 3: b c d a b c d
# 4: f g h e f g h
# 5: b c d a b c d
# ---
# 999996: f g h e f g h
# 999997: b c d a b c d
# 999998: f g h e f g h
# 999999: b c d a b c d
# 1000000: f g h e f g h