4

I was wondering whether the most efficient way to extract text from a column was using the sub function in a data.table.

For example, we have the following data set:

test <- data.table(a = c("Hello world, this is Tom and I am a guy", "Hello world, this is Jack and I am a guy"))

and I would like to extract the names. One way to extract the names are using the substitution function

test[, Name := sub(".*? this is (.*?) and.*", "\\1", a)]

but I was wondering, is this the most efficient way?

Snowflake
  • 2,869
  • 3
  • 22
  • 44
  • Fyi, better, when asking a performance/efficiency question to provide a scalable example. In this case, the number of distinct values really matters, eg with @akrun's example we get timing near 0 when using `by=` ... `system.time(test1[, Name := sub(".*? this is (.*?) and.*", "\\1", a), by=a])`. An example: https://stackoverflow.com/q/4322219/1191259 – Frank Jul 17 '18 at 15:27

1 Answers1

6

Combining with str_extract and str_remove reduces the time

library(stringr)
test1 <- test[rep(seq_len(.N), 1e6)]
test2 <- copy(test1)
system.time(test1[, Name := sub(".*? this is (.*?) and.*", "\\1", a)])    
#   user  system elapsed 
#  4.590   0.002   4.597 
system.time(test2[, Name :=  str_remove(str_extract(a, "this is \\w+"), 
                  "this is ")])
#   user  system elapsed 
#   2.259   0.076   2.339 

identical(test1$Name, test2$Name)
#[1] TRUE

library(microbenchmark)
f1 <- function()  sub(".*? this is (.*?) and.*", "\\1", test1$a)
f2 <- function() str_remove(str_extract(test1$a, "this is \\w+"), "this is ")

microbenchmark(f1(), f2(), unit = 'relative', times = 10L)  
#Unit: relative
#expr     min      lq     mean   median       uq      max neval
# f1() 2.12879 2.14592 2.145551 2.173798 2.188693 2.121836    10
# f2() 1.00000 1.00000 1.000000 1.000000 1.000000 1.000000    10
akrun
  • 874,273
  • 37
  • 540
  • 662