I would like to ask a question related to options for string split in R. As far as I know, I can see three options. strsplit()
in base, str_split()
in the stringr
package, and separate()
in the tidy
package. I wonder how they are different from programming points of view. Given I am not trained as a programmer, this sentence may not be clear. Let me give you an example. In the past, I learned the difference between rbind()
and rbindlist()
in the data.table
package. (Why is rbindlist "better" than rbind?). This was great learning for me. I would like to know which string option is better than others, just like this post related to rbind()
and rbindlist()
. I hope this example clarify what I am trying to ask. Thank you for taking your time.
Asked
Active
Viewed 517 times
2
-
1I recommend you try each of them. :) Use the microbenchmark package to test which works fastest. – mgriebe Aug 14 '14 at 06:48
-
This site is not dedicated to seeking opinions, but answering programming questions. Rephrase your question to something that can be objectively answered. – Roman Luštrik Aug 14 '14 at 07:20
-
Roman, thank you for your comments. I had no intention to seek options. I rather wanted to know how they are different from programming point of views. I guess my intention did not go through given English is my second language. I will rephrase my question. Thank you. – jazzurro Aug 14 '14 at 07:39
-
If you are really trying to look around, the QDAP package has colSplit() and colSplit2df(). – lawyeR Aug 14 '14 at 12:24
1 Answers
1
Unlike strsplit() and str_split(), separate takes a data frame and places the output in separate columns in the data frame. str_split lets you specify the maximum number of strings to return for any split.
There are may ways to split strings (in certain cirumstances, you can use substr and or grep). For large data, consider the answers in this post: Split text string in a data.table columns
Here are some benchmark results, and you can create your own:
require(microbenchmark)
require(stringr)
require(tidyr)
require(data.table)
dt<-data.table(a=letters[1:20],b=letters[15:21],c=1:100)
dt[,d:=paste(a,b,sep=".")]
this<-dt[,d]
microbenchmark(strsplit(this,"[.]"),str_split(this,"[.]"),separate(dt,"d",c("e","f"),"[.]"))
# Unit: microseconds
# expr min lq median uq max neval
# strsplit(this, "[.]") 53.432 56.753 59.4705 62.941 103.846 100
# str_split(this, "[.]") 4390.459 4878.137 5020.0180 5118.127 6598.367 100
# separate(dt, "d", c("e", "f"), "[.]") 165.126 178.107 189.7290 232.142 299.460