2

I would like to ask a question related to options for string split in R. As far as I know, I can see three options. strsplit() in base, str_split() in the stringr package, and separate() in the tidy package. I wonder how they are different from programming points of view. Given I am not trained as a programmer, this sentence may not be clear. Let me give you an example. In the past, I learned the difference between rbind() and rbindlist() in the data.table package. (Why is rbindlist "better" than rbind?). This was great learning for me. I would like to know which string option is better than others, just like this post related to rbind() and rbindlist(). I hope this example clarify what I am trying to ask. Thank you for taking your time.

Community
  • 1
  • 1
jazzurro
  • 23,179
  • 35
  • 66
  • 76
  • 1
    I recommend you try each of them. :) Use the microbenchmark package to test which works fastest. – mgriebe Aug 14 '14 at 06:48
  • This site is not dedicated to seeking opinions, but answering programming questions. Rephrase your question to something that can be objectively answered. – Roman Luštrik Aug 14 '14 at 07:20
  • Roman, thank you for your comments. I had no intention to seek options. I rather wanted to know how they are different from programming point of views. I guess my intention did not go through given English is my second language. I will rephrase my question. Thank you. – jazzurro Aug 14 '14 at 07:39
  • If you are really trying to look around, the QDAP package has colSplit() and colSplit2df(). – lawyeR Aug 14 '14 at 12:24

1 Answers1

1

Unlike strsplit() and str_split(), separate takes a data frame and places the output in separate columns in the data frame. str_split lets you specify the maximum number of strings to return for any split.

There are may ways to split strings (in certain cirumstances, you can use substr and or grep). For large data, consider the answers in this post: Split text string in a data.table columns

Here are some benchmark results, and you can create your own:

    require(microbenchmark)
    require(stringr)
    require(tidyr)
    require(data.table)

    dt<-data.table(a=letters[1:20],b=letters[15:21],c=1:100)
    dt[,d:=paste(a,b,sep=".")]
    this<-dt[,d]

    microbenchmark(strsplit(this,"[.]"),str_split(this,"[.]"),separate(dt,"d",c("e","f"),"[.]"))
#    Unit: microseconds
#                                      expr      min       lq    median       uq      max neval
#                     strsplit(this, "[.]")   53.432   56.753   59.4705   62.941  103.846   100
#                    str_split(this, "[.]") 4390.459 4878.137 5020.0180 5118.127 6598.367   100
#     separate(dt, "d", c("e", "f"), "[.]")  165.126  178.107  189.7290  232.142  299.460   
Community
  • 1
  • 1
mgriebe
  • 908
  • 5
  • 8