Changing RegexOptions.None | RegexOptions.Multiline | RegexOptions.IgnoreCase
to RegexOptions.Compiled
yields the same results (since your pattern does not include any literal letters or ^
/$
).
On my machine this reduces the time taken on the sample document you linked from 46 seconds to 21 seconds (which still seems slow to me, but might be good enough for you).
EDIT: So I looked into this some more and have discovered the real issue.
The problem is with the first half of your regex: \w+([-.]\w+)*\.\w+([-.]\w+)*@
. This works fine when matching sections of the input that actually contain the @
symbol, but for sections that match just \w+([-.]\w+)*\.\w+([-.]\w+)*
but are not followed by @
, the regex engine wastes a lot of time backtracking and retrying from each position in the sequence (and failing again because there is still no @
!)
You can fix this by forcing the match to start at a word boundary using \b
:
string pattern = @"\b\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";
On your sample document, this produces the same 10 results in under 1 second.