Some Unicode characters in a string make str_match much slower

Question

I'm using a regex to parse item identifier information into separate pieces. It generally works fine, but I notice that in a few cases when there are some unexpected unicode characters (I think this is the cause, but I'm not certain) it takes many orders of magnitude longer to parse the string. Here's an example: string1 contains the unicode character \u2028; string2 is unicode free.

library(stringr)

pattern.items <- "(\\d{1,4})\\s+((?:\\w+[_.]*\\w*)+)\\s+(\\d{1,2})\\s+(Green|Purple|Yellow).*$"

string1 <- "1663 scomp.crit1.evidence    8 Green  scomp.1          1 4. \u2028\u2028What are the similarities and differences between beetle and fly wings?                                     evidence"

string2 <- "1664 scomp.infer3.response   8 Green  scomp.3          3 5. If you see an insect flying at night, is it more likely to be a butterfly or a moth?                                    response"

system.time(
str_match(string1, pattern.items)
)

system.time(
str_match(string2, pattern.items)
)

Here are the timings:

system.time(
+ str_match(string1, pattern.items)
+ )
   user  system elapsed 
 44.583   0.000  44.590 
> 
> system.time(
+ str_match(string2, pattern.items)
+ )
   user  system elapsed 
  0.000   0.000   0.001 
>

I'm running code like this on a 2093-row data frame. I started it 3 hours ago and it still hasn't finished. Did I do something incorrect, or is the problem elsewhere?

There are nested quantifiers, see what happens when there is no match https://regex101.com/r/bWwieq/1 — The fourth bird, Feb 21 '22 at 12:10
Please check the more obvious catastrophic backtracking issue: does `"(\\d{1,4})\\s+(\\w+(?:[_.]\\w+)*)\\s+(\\d{1,2})\\s+(Green|Purple|Yellow).*"` work as expected? — Wiktor Stribiżew, Feb 21 '22 at 12:10
I tested the regex in regex101 and the unicode characters did not cause any problem. https://regex101.com/r/gITm9i/1 — Stuart, Feb 21 '22 at 12:13
My example in regex101 doesn't say anything about catastrophic backtracking. ?? It says 1 match, 35 steps, 0.6 ms — Stuart, Feb 21 '22 at 12:17
Thanks for your help Wiktor. However, three problems: 1) I don't understand what catastrophic backtracking is 2) I don't know how I would do to the regex to avoid it 3) I don't know what the unicode has to do with it — Stuart, Feb 21 '22 at 12:23
Uh oh. I actually looked at the str_match output (and not just at the timings) and I got all NA's. I guess I have to adjust my pattern. I'll try tomorrow. — Stuart, Feb 21 '22 at 12:30

Some Unicode characters in a string make str_match much slower

0 Answers0