I have a code
matrix<-outer(df$text, df1$regexp, str_count)
df with more than 1000 text's, each about 1500 symbols and df1 with 500 regex-negation expressions like
(?<!(no|not|n`t|n’t|neither|never|no one|nobody|none|nor|nothing|nowhere|hardly|barely|scarcely|unlikely|seldom|rarely))[ ][aA][bB][aA][nN][dD][oO][nN]
so my code running more than one hour
how I can accelerate my code?
example:
library(stringr)
df<-data.frame(names=c("text1","text2"), text=c("one two three four five","six seven eight nine ten"))
regex<-data.frame(names=c("1","2"), regexp=c("(?<!(no|not))[ ][oO][nN][eE]","(?<!(no|not))[ ][fF][iI][vV][eE]"))
matrix<-outer(df$text, as.character(regex$regexp), str_count)
I've tried run code in parallel with
library(stringr)
library(parallel)
no_cores <- detectCores() - 1
df<-data.frame(names=c("text1","text2"), text=c("one two three four five","six seven eight nine ten"))
regex<-data.frame(names=c("1","2"), regexp=c("(?<!(no|not))[ ][oO][nN][eE]","(?<!(no|not))[ ][fF][iI][vV][eE]"))
cl <- makeCluster(no_cores)
matrix<-parSapply(cl,regex$regexp, str_count, string=df$text)
stopCluster(cl)
and now code faster about 40% on my 4-core PC
I've change all regex like Wiktor recommend and with code run faster about 25% than paralleled code with old regex
(?<!n(?:[`’]t|e(?:ither|ver)|o(?:t| one|body|ne|r|thing|where){0,1})|hardly|barely|scarcely|unlikely|seldom|rarely)[ ][aA][bB][aA][nN][dD][oO][nN]