7

I have a regular expression that looks for email addresses ( this was taken from another SO post that I can't find and has been tested on all kinds of email configurations ... changing this is not exactly my question ... but understand if that is the root cause ):

/[a-z0-9_\-\+]+@[a-z0-9\-]+\.([a-z]{2,3})(?:\.[a-z]{2})?/i

I'm using preg_match_all() in PHP.

This works great for 99.99...% of files I'm looking in and takes around 5ms, but occasionally takes a couple minutes. These files are larger than the average webpage at around 300k, but much larger files generally process fine. The only thing I can find in the file contents that stands out is strings of thousands of consecutive "random" alphanumeric characters like this:

wEPDwUKMTk0ODI3Nzk5MQ9kFgICAw9kFgYCAQ8WAh4H...

Here are two pages causing the problem. View source to see the long strings.

Any thoughts on what is causing this?

--FINAL SOLUTION--

I tested various regexes suggested in the answers. @FailedDev's answer helped and dropped processing time from a few minutes to a few seconds. @hakre's answer solved the problem and reduced processing time to a few hundred milliseconds. Below is the final regex I used. It's @hakre's second suggestion.

/[a-z0-9_\-\+]{1,256}+@[a-z0-9\-]{1,256}+\.([a-z]{2,3})(?:\.[a-z]{2})?/i
T. Brian Jones
  • 13,002
  • 25
  • 78
  • 117
  • 2
    Csn you convert your `+` in `++` so they don't backtrack? – ninjalj Nov 17 '11 at 23:07
  • 2
    You're collecting email addresses by scraping them from pages...so you can spam people...and you want us to help? – Jonathan M Nov 17 '11 at 23:08
  • Can you clarify that comment @ninjalj? – T. Brian Jones Nov 17 '11 at 23:10
  • @mario: have you tested with case insensitivity? For me (PCRE 7.6) (ancient, I know), both Perl and pcregrep w/o case insensitivity are instantaneous, while `pcregrep -i` takes 5.5s on the possesive quantifier case (28s on the greedy case). Also, as expected, atomic greedy takes the same time as possesive. – ninjalj Nov 17 '11 at 23:39
  • 2
    @ninjalj: Used `/i` and PCRE 8.12 (now dated too). Actually my previous test probably just failed (php.ini `pcre.backtrack_limit` likely). Run again. Using possesive approach really didn't help much. It was indeed using a reasonable quantifier `{1,50}` which had the best performance impact. – mario Nov 18 '11 at 00:11
  • @mario: same here: `{1,256}` goes down to 1.7s, `{1,256}+` goes to 0.4s. – ninjalj Nov 18 '11 at 00:26

2 Answers2

8

You already know that your regex is causing an issue for large files. So maybe you can make it a bit smarter?

For example, you're using + to match one or more chars. Let's say you have a string of 10 000 chars. The regex must look 10 000 combinations to find the largest match. Then you combine it with similar ones. Let's say you have a string with 20 000 chars and two + groups. How could they match in the file. Probably 10 000 x 10 000 possibilities. And so on and so forth.

If you can limit the number of characters (this looks a bit like you're looking for email patterns), probably limit the email address domain name to 256 and the address itself to 256 characters. Then this would be 256 x 256 possibilities to test "only":

/[a-z0-9_\-\+]{1,256}@[a-z0-9\-]{1,256}\.([a-z]{2,3})(?:\.[a-z]{2})?/i

That's probably already much faster. Then making those quantifiers possessive will reduce backtracking for PCRE:

/[a-z0-9_\-\+]{1,256}+@[a-z0-9\-]{1,256}+\.([a-z]{2,3})(?:\.[a-z]{2})?/i

Which should speed it up again.

hakre
  • 193,403
  • 52
  • 435
  • 836
6

My best guess would be to try using possesive quantifiers :

[a-z0-9_\-\+]+

to

[a-z0-9_\-\+]++

This should fail the regex faster so it may improve performance in these situations.

Edit:

Maybe atomic grouping could also help :

/(?>[a-z0-9_\-+]++)@(?>[a-z0-9\-]++\.)(?>[a-z]{2,3})(?:\.[a-z]{2})?/

You should first go with option one. It would be interesting to see if there is any difference by also using option two.

FailedDev
  • 26,680
  • 9
  • 53
  • 73
  • Although there _are_ slight differences between atomic grouping and possessive quantifiers, you really only need to use one. Making all greedy quantifiers possessive should fix Brian's problem, is my guess. – Bart Kiers Nov 17 '11 at 23:21
  • @BartKiers Yeah I had this conflict myself. I will edit the post accordingly. Maybe the OP could post some benchmarks using both options. – FailedDev Nov 17 '11 at 23:23
  • `{2,3}` should be converted to `{2,3}+` to avoid potentially backtracking once, right? – ninjalj Nov 17 '11 at 23:26
  • 1
    @ninjalj Well yes, but I would assume this plays a minor role into the general case. – FailedDev Nov 17 '11 at 23:28
  • @ninjalj, yes, it could also be made possessive, although that wouldn't make a lot of difference since it would never match more than 3 chars. It's more important to make the char classes with an `*` or `+` after it possessive. – Bart Kiers Nov 17 '11 at 23:29
  • While this helps (see my comment to the question), it seems the main problem is pcre performance with case insensitive matching. – ninjalj Nov 17 '11 at 23:43
  • @ninjalj Case insensitivity literally halves the possible matches right? But you can't really turn it off can you? – FailedDev Nov 17 '11 at 23:45
  • @FailedDev: that was in the ASCII case, Unicode would need to use locale-aware case-folding. And yes, I tried with `LC_ALL=C`. Also, you have it backwards, case-insensitive means it matches regardless of case. – ninjalj Nov 17 '11 at 23:47
  • I benchmarked various answers in the main question body. I gave rough estimates because it depends greatly on the page being processed. – T. Brian Jones Nov 18 '11 at 03:37