1

I am new to R and I have a big dataset with 17 columns and over a 1m rows. I want to split one of the columns into 4 by divider '/'. It's taking forever for R to complete the below commands. Is there a better way of completing the below. I have given as much code info as possible and would appreciate any help.

sample2 <- read.csv("week1.csv", header=TRUE)
summ1 <- subset(sample2,select= -c(3,7), subset =(SPORTS_ID =='1'))
summ1 <- summ1[,-c(1)]
library(splitstackshape)
summ2 <- concat.split.multiple(summ1,2 , "/")
summ2 <- summ2[,-c(1,15)]
summ3 <- concat.split.multiple(summ2,14, "v")
write.csv(summm3, file="test.csv")
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77

4 Answers4

1

You can use strsplit:

dat <- data.frame(a = c("a/b/c/d",
                        "e/f/g/h"),
                  stringsAsFactors = FALSE)
#         a
# 1 a/b/c/d
# 2 e/f/g/h

cbind(dat, do.call(rbind, strsplit(dat$a, "/")))
#         a 1 2 3 4
# 1 a/b/c/d a b c d
# 2 e/f/g/h e f g h
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
  • +1. I'd add a `fixed = TRUE` to your `strsplit` for efficiency. This fares quite decently in [the benchmarks I've posted](http://stackoverflow.com/a/22255113/1270695). – A5C1D2H2I1M1N2O1R2T1 Mar 07 '14 at 16:19
1

As I mentioned in my comment, if your data are balanced (that is, you expect a nice rectangular dataset after splitting the data) you should look at my concat.split.DT function.

Here are some tests.

Sven's data, but with 20K rows instead of 2

dat <- do.call(rbind, replicate(1e4, dat, simplify=FALSE))
dim(dat)
# [1] 20000     1

The "stringr" functions are likely to be a bit slow:

library(stringr)
system.time(do.call(rbind, str_split(dat$a,  "/")))
#    user  system elapsed 
#   3.194   0.000   3.211 

But how do the other solutions fare?

fun1 <- function() concat.split.multiple(dat, "a", "/")
fun2 <- function() do.call(rbind, strsplit(dat$a, "/", fixed=TRUE))
## ^^ fixed = TRUE will make a big difference
fun3 <- function() concat.split.DT(dat, "a", "/")

library(microbenchmark)
microbenchmark(fun1(), fun2(), fun3(), times = 10)
# Unit: milliseconds
#    expr       min        lq    median        uq       max neval
#  fun1() 530.46597 534.13486 535.19139 538.91488 553.61919    10
#  fun2()  30.22265  31.07287  31.81474  32.93936  40.28859    10
#  fun3()  22.57517  22.94169  23.10297  23.30907  31.97640    10

So, that's about half a second for the regular concat.split.multiple (which just uses read.table under the hood), and much better results for strsplit and concat.split.DT (the latter of which uses fread from "data.table" under the hood).

Let's scale it up even more, to 1 million rows now...

dat <- do.call(rbind, replicate(50, dat, simplify=FALSE))
dim(dat)
# [1] 1000000       1

microbenchmark(fun2(), fun3(), times = 5)
# Unit: seconds
#    expr      min       lq    median        uq       max neval
#  fun2() 6.257892 6.522199 13.728283 13.934860 14.277432     5
#  fun3() 1.671739 1.830485  2.203076  2.470872  2.572917     5

The advantage of the concat.split.DT approach is the convenience of splitting multiple columns with a simple syntax:

dat2 <- do.call(cbind, replicate(5, dat, simplify = FALSE))
dim(dat2)
# [1] 1000000       5
names(dat2) <- make.unique(names(dat2))
head(dat2)
#         a     a.1     a.2     a.3     a.4
# 1 a/b/c/d a/b/c/d a/b/c/d a/b/c/d a/b/c/d
# 2 e/f/g/h e/f/g/h e/f/g/h e/f/g/h e/f/g/h
# 3 a/b/c/d a/b/c/d a/b/c/d a/b/c/d a/b/c/d
# 4 e/f/g/h e/f/g/h e/f/g/h e/f/g/h e/f/g/h
# 5 a/b/c/d a/b/c/d a/b/c/d a/b/c/d a/b/c/d
# 6 e/f/g/h e/f/g/h e/f/g/h e/f/g/h e/f/g/h

Now, let's split all of them at once:

system.time(out <- concat.split.DT(dat2, names(dat2), "/"))
#    user  system elapsed 
#   6.260   0.040   6.532 
out
#          a_1 a_2 a_3 a_4 a.1_1 a.1_2 a.1_3 a.1_4 a.2_1 a.2_2 a.2_3 a.2_4 a.3_1
#       1:   a   b   c   d     a     b     c     d     a     b     c     d     a
#       2:   e   f   g   h     e     f     g     h     e     f     g     h     e
#       3:   a   b   c   d     a     b     c     d     a     b     c     d     a
#       4:   e   f   g   h     e     f     g     h     e     f     g     h     e
#       5:   a   b   c   d     a     b     c     d     a     b     c     d     a
#      ---                                                                      
#  999996:   e   f   g   h     e     f     g     h     e     f     g     h     e
#  999997:   a   b   c   d     a     b     c     d     a     b     c     d     a
#  999998:   e   f   g   h     e     f     g     h     e     f     g     h     e
#  999999:   a   b   c   d     a     b     c     d     a     b     c     d     a
# 1000000:   e   f   g   h     e     f     g     h     e     f     g     h     e
#          a.3_2 a.3_3 a.3_4 a.4_1 a.4_2 a.4_3 a.4_4
#       1:     b     c     d     a     b     c     d
#       2:     f     g     h     e     f     g     h
#       3:     b     c     d     a     b     c     d
#       4:     f     g     h     e     f     g     h
#       5:     b     c     d     a     b     c     d
#      ---                                          
#  999996:     f     g     h     e     f     g     h
#  999997:     b     c     d     a     b     c     d
#  999998:     f     g     h     e     f     g     h
#  999999:     b     c     d     a     b     c     d
# 1000000:     f     g     h     e     f     g     h
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
0

This should get you started. You may need to adjust the regex patterns according to what your data contains. A reproducible example would help. How to make a great R reproducible example?

library(stringr)
df <- as.data.frame(cbind(x = seq(1,10,1), y = rep("first/second", 10)), stringsAsFactors = FALSE)
df
df$first <- str_replace(df$y, "\\/\\w+", "")
df$second <- str_replace(df$y, "\\w+\\/", "")
df

> df
    x            y first second
1   1 first/second first second
2   2 first/second first second
3   3 first/second first second
4   4 first/second first second
5   5 first/second first second
6   6 first/second first second
7   7 first/second first second
8   8 first/second first second
9   9 first/second first second
10 10 first/second first second
Community
  • 1
  • 1
r.bot
  • 5,309
  • 1
  • 34
  • 45
  • Sorry, but this would be a pretty bad idea for a few reasons. For starters, do you plan to do this if, say, you expected the result of the split to have 20 columns? – A5C1D2H2I1M1N2O1R2T1 Mar 07 '14 at 16:21
  • The original question stated that four new columns were expected. What are your other concerns? – r.bot Mar 07 '14 at 16:23
  • Another concern is that "stringr" is not necessarily going to give you any speed boost. Not to mention that a good answer on SO is not necessarily one that solves an immediate problem as stated in the question, but one that might propose solutions that are applicable to a wider range of situations. – A5C1D2H2I1M1N2O1R2T1 Mar 07 '14 at 16:25
  • Hi Guys thanks for all the responses, I will try these out in awhile. A copy of the data column is below, i will be splitting into 4 new columns = English Soccer/Barclays Premier League/Fixtures 18 August/Arsenal v Sunderland – user3393102 Mar 08 '14 at 16:19
  • I want to break down the above into English Soccer for example in one column under column heading League, Barclays Premier league next column under Comp etc – user3393102 Mar 08 '14 at 16:23
0

If you're going to be working with characters and don't mind lists, str_split within the stringr package should help

library(stringr)
x <- 'hello/hi/hey/hola'
str_split(x)
[[1]]
[1] "hello" "hi" "hey" "hola" 
maloneypatr
  • 3,562
  • 4
  • 23
  • 33
  • I was guessing ([and was able to verify](http://stackoverflow.com/a/22255113/1270695)) that this approach would not be very efficient. Using base R's string manipulation tools would be much more efficient. – A5C1D2H2I1M1N2O1R2T1 Mar 07 '14 at 16:17