0

I have a code

matrix<-outer(df$text, df1$regexp, str_count)

df with more than 1000 text's, each about 1500 symbols and df1 with 500 regex-negation expressions like

(?<!(no|not|n`t|n’t|neither|never|no one|nobody|none|nor|nothing|nowhere|hardly|barely|scarcely|unlikely|seldom|rarely))[ ][aA][bB][aA][nN][dD][oO][nN]

so my code running more than one hour

how I can accelerate my code?

example:

library(stringr)
df<-data.frame(names=c("text1","text2"), text=c("one two three four five","six seven eight nine ten"))
regex<-data.frame(names=c("1","2"), regexp=c("(?<!(no|not))[ ][oO][nN][eE]","(?<!(no|not))[ ][fF][iI][vV][eE]"))
matrix<-outer(df$text, as.character(regex$regexp), str_count)

I've tried run code in parallel with

library(stringr)
library(parallel)
no_cores <- detectCores() - 1
df<-data.frame(names=c("text1","text2"), text=c("one two three four five","six seven eight nine ten"))
regex<-data.frame(names=c("1","2"), regexp=c("(?<!(no|not))[ ][oO][nN][eE]","(?<!(no|not))[ ][fF][iI][vV][eE]"))
cl <- makeCluster(no_cores)
matrix<-parSapply(cl,regex$regexp, str_count, string=df$text)
stopCluster(cl)

and now code faster about 40% on my 4-core PC

I've change all regex like Wiktor recommend and with code run faster about 25% than paralleled code with old regex

(?<!n(?:[`’]t|e(?:ither|ver)|o(?:t| one|body|ne|r|thing|where){0,1})|hardly|barely|scarcely|unlikely|seldom|rarely)[ ][aA][bB][aA][nN][dD][oO][nN]
stack user
  • 88
  • 1
  • 1
  • 8

2 Answers2

1

The regex flavor used in stringr is ICU (thus, cannot be tested to see if it works or not at regex101.com) and this flavor does not require fully fixed-width lookbehinds. It does have a support for a limiting quantifier as well as regular * and + in some simple cases (though these latter two are more a bug than a feature and might get fixed later).

So, your regex works slowly because several alternation branches start with the same substrings. That creates excessive backtracking. You need to make sure each branch cannot match at one and the same location.

Use

(?<!n(?:[`’]t|e(?:ither|ver)|o(?:t| one|body|ne|r|thing|where){0,1})|hardly|barely|scarcely|unlikely|seldom|rarely)[ ][aA][bB][aA][nN][dD][oO][nN]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Create the data correctly up-front (character rather than factor)

df <- data.frame(names=c("text1","text2"),
               text=c("one two three four five",
                      "six seven eight nine ten"),
               stringsAsFactors=FALSE)

regex <- data.frame(names=c("1","2"), 
                    regexp=c("(?<!(no|not))[ ][oO][nN][eE]",
                             "(?<!(no|not))[ ][fF][iI][vV][eE]"),
                    stringsAsFactors=FALSE)

R functions are generally 'vectorized', which means each regular expression can be applied to the vector of strings

str_count(pattern=regex$regex[1], string=df$text)

or

sapply(regex$regex, str_count, string=df$text)

For instance,

> sapply(regex$regex, str_count, string=df$text)
     (?<!(no|not))[ ][oO][nN][eE] (?<!(no|not))[ ][fF][iI][vV][eE]
[1,]                            0                                1
[2,]                            0                                0

Likely this will scale linearly in both dimensions, but much faster (compared to use of outer()) as the length(df$text) increases.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112