7

I have a large text vector I would like to search for a particular character or phrase. Regular expressions are taking forever. How do I search it quickly?

Sample data:

R <- 10^7
garbage <- replicate( R, paste0(sample(c(letters[1:5]," "),10,replace=TRUE),collapse="") )
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235

2 Answers2

11

If you do need regular expressions, you can generally get a performance increase over the default regular expression engine by using the PCRE library (by setting perl=TRUE). There are other performance tips in ?grep:

Performance considerations:

If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and ‘fixed = TRUE’ faster still (especially when each pattern is matched only a few times).

If you are working in a single-byte locale and have marked UTF-8 strings that are representable in that locale, convert them first as just one UTF-8 string will force all the matching to be done in Unicode, which attracts a penalty of around 3x for the default POSIX 1003.2 mode.

If you can make use of ‘useBytes = TRUE’, the strings will not be checked before matching, and the actual matching will be faster. Often byte-based matching suffices in a UTF-8 locale since byte patterns of one character never match part of another.

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • Cool. Presumably there are tricks with factors that could work as well if you have lots of repeated vectors.... – Ari B. Friedman Oct 18 '13 at 23:17
  • 1
    The perl flag made a major positive performance impact. Nearly instantaneous in my case on a 43 MB data frame. Previously the time was around 2-3 seconds per. Since I was making an app that needs live time response, this was the way to go. – Shawn Oct 15 '18 at 03:53
8

There's no need for regular expressions here, and their power comes with a computational cost.

You can turn off regular expression parsing in any of the regex functions in R with the ,fixed=TRUE argument. Speed gains result:

library(microbenchmark)
m <- microbenchmark( 
    grep( " ", garbage, fixed=TRUE ),
    grep( " ", garbage )
)
m
Unit: milliseconds
                             expr       min        lq   median        uq      max neval
 grep(" ", garbage, fixed = TRUE)  491.5634  497.1309  499.109  503.3009 1128.643   100
               grep(" ", garbage) 1786.8500 1801.9837 1810.294 1825.2755 3620.346   100
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235