I'm using a regex to parse item identifier information into separate pieces. It generally works fine, but I notice that in a few cases when there are some unexpected unicode characters (I think this is the cause, but I'm not certain) it takes many orders of magnitude longer to parse the string. Here's an example: string1 contains the unicode character \u2028; string2 is unicode free.
library(stringr)
pattern.items <- "(\\d{1,4})\\s+((?:\\w+[_.]*\\w*)+)\\s+(\\d{1,2})\\s+(Green|Purple|Yellow).*$"
string1 <- "1663 scomp.crit1.evidence 8 Green scomp.1 1 4. \u2028\u2028What are the similarities and differences between beetle and fly wings? evidence"
string2 <- "1664 scomp.infer3.response 8 Green scomp.3 3 5. If you see an insect flying at night, is it more likely to be a butterfly or a moth? response"
system.time(
str_match(string1, pattern.items)
)
system.time(
str_match(string2, pattern.items)
)
Here are the timings:
system.time(
+ str_match(string1, pattern.items)
+ )
user system elapsed
44.583 0.000 44.590
>
> system.time(
+ str_match(string2, pattern.items)
+ )
user system elapsed
0.000 0.000 0.001
>
I'm running code like this on a 2093-row data frame. I started it 3 hours ago and it still hasn't finished. Did I do something incorrect, or is the problem elsewhere?