How to speed up regex search in R?

Question

I have a code

matrix<-outer(df$text, df1$regexp, str_count)

df with more than 1000 text's, each about 1500 symbols and df1 with 500 regex-negation expressions like

(?<!(no|not|n`t|n’t|neither|never|no one|nobody|none|nor|nothing|nowhere|hardly|barely|scarcely|unlikely|seldom|rarely))[ ][aA][bB][aA][nN][dD][oO][nN]

so my code running more than one hour

how I can accelerate my code?

example:

library(stringr)
df<-data.frame(names=c("text1","text2"), text=c("one two three four five","six seven eight nine ten"))
regex<-data.frame(names=c("1","2"), regexp=c("(?<!(no|not))[ ][oO][nN][eE]","(?<!(no|not))[ ][fF][iI][vV][eE]"))
matrix<-outer(df$text, as.character(regex$regexp), str_count)

I've tried run code in parallel with

library(stringr)
library(parallel)
no_cores <- detectCores() - 1
df<-data.frame(names=c("text1","text2"), text=c("one two three four five","six seven eight nine ten"))
regex<-data.frame(names=c("1","2"), regexp=c("(?<!(no|not))[ ][oO][nN][eE]","(?<!(no|not))[ ][fF][iI][vV][eE]"))
cl <- makeCluster(no_cores)
matrix<-parSapply(cl,regex$regexp, str_count, string=df$text)
stopCluster(cl)

and now code faster about 40% on my 4-core PC

I've change all regex like Wiktor recommend and with code run faster about 25% than paralleled code with old regex

(?<!n(?:[`’]t|e(?:ither|ver)|o(?:t| one|body|ne|r|thing|where){0,1})|hardly|barely|scarcely|unlikely|seldom|rarely)[ ][aA][bB][aA][nN][dD][oO][nN]

http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — xxfelixxx, Jun 20 '16 at 11:14
Try `(?<!n(?:[\`’]t|e(?:ither|ver)|o(?:t| one|body|ne|r|thing|where){0,1})|hardly|barely|scarcely|unlikely|seldom|rarely)[ ][aA][bB][aA][nN][dD][oO][nN]`. — Wiktor Stribiżew, Jun 20 '16 at 11:31
Hi, welcome to SO. Please consider reading up on [ask] and how to produce a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). It makes it easier for others to help you. — Heroka, Jun 20 '16 at 12:03
I'm use regex that Wiktor suggest and it's faster, but why regex101.com don't understand this regex? — stack user, Jun 20 '16 at 15:14
:) Because regex101.com has no ICU regex flavor support, it only "understands" PCRE/JS and Python flavors. So, shall I post my suggestion? Does it work "well-enough"? — Wiktor Stribiżew, Jun 20 '16 at 15:41

score 1 · Accepted Answer · answered Jun 20 '16 at 16:08

The regex flavor used in stringr is ICU (thus, cannot be tested to see if it works or not at regex101.com) and this flavor does not require fully fixed-width lookbehinds. It does have a support for a limiting quantifier as well as regular * and + in some simple cases (though these latter two are more a bug than a feature and might get fixed later).

So, your regex works slowly because several alternation branches start with the same substrings. That creates excessive backtracking. You need to make sure each branch cannot match at one and the same location.

Use

(?<!n(?:[`’]t|e(?:ither|ver)|o(?:t| one|body|ne|r|thing|where){0,1})|hardly|barely|scarcely|unlikely|seldom|rarely)[ ][aA][bB][aA][nN][dD][oO][nN]

Martin Morgan · Answer 2 · 2016-06-20T13:33:22.877

Create the data correctly up-front (character rather than factor)

df <- data.frame(names=c("text1","text2"),
               text=c("one two three four five",
                      "six seven eight nine ten"),
               stringsAsFactors=FALSE)

regex <- data.frame(names=c("1","2"), 
                    regexp=c("(?<!(no|not))[ ][oO][nN][eE]",
                             "(?<!(no|not))[ ][fF][iI][vV][eE]"),
                    stringsAsFactors=FALSE)

R functions are generally 'vectorized', which means each regular expression can be applied to the vector of strings

str_count(pattern=regex$regex[1], string=df$text)

or

sapply(regex$regex, str_count, string=df$text)

For instance,

> sapply(regex$regex, str_count, string=df$text)
     (?<!(no|not))[ ][oO][nN][eE] (?<!(no|not))[ ][fF][iI][vV][eE]
[1,]                            0                                1
[2,]                            0                                0

Likely this will scale linearly in both dimensions, but much faster (compared to use of outer()) as the length(df$text) increases.

How to speed up regex search in R?

2 Answers2