The worst input for a regular expression will vary from engine to engine. The same regex and string may take no time at all on one engine, but never finish on another.
Differences between engines
Engine Type
For certain engines, the "evilest" regex is still benign, running in linear time (or O(n*m)
time when both the length of the regex and the length of the string may vary.) Of course, the reason for this is the implementation. These engines don't backtrack; instead they use a finite state machine (FSM).
Note that some backtracking implementations use FSM, but only as an intermediate step. Don't let this confuse you; they're not FSM.
Most of the old regex engines (like sed) use FSM matching. There are a few new engines that use this implementation, such as Go. PCRE even has DFA functions (search for "DFA" here) that use this type of matching.
Another answer of mine also addresses the potential speed difference between the two implementations.
It would be wise to consider using a FSM implementation if you are really worried about malicious input affecting the speed of your regex. Unfortunately, FSM is not as powerful as the other implementation; it lacks support for some features, such as back references.
Optimizations
Evil is actually a bit subjective. Something evil to one regex engine may not be evil to a different engine. An evil plot can be thwarted if the engine is optimized. Optimizations are particularly important to backtracking engines, given their potential exponential run time.
Short-circuiting
Under certain conditions, the engine may be able to quickly determine a match is impossible. While running the regex a*b?a*x
against the string aaaaaaaaaaaaaaaaaaaaaaaaaa
, Regex101 says:
Your match failed outright. What this means is the engine, due to its internal optimizations, understood that your pattern would never match at any position, and thus did not even attempt to.
Keep in mind that Wikipedia says the regex is evil, especially when paired with that string.
Of course, the engine is smart to not need to backtrack to determine the match wouldn't work. It saw something pretty obvious: the regex needs an x
in order to match, but no x
was present in the string.
Modifiers
I mention this because you might not expect modifiers to be a factor in regex performance. But they are.
Even PCRE, one of the more optimized implementations, may take considerably more steps with both the u
and i
modifiers enabled. See my question here for more information about this. In the end, I figured out that only certain characters trigger this behavior.
Analyzing Strings
String length
In general, a long string will be slower than a short string. In fact, if you find a string of length x that causes catastrophic backtracking, you can make it backtrack a bit more by increasing the length of the string.
Greedy vs. Lazy
Compare the speeds of these regexes:
.*b
on aaaaaaaa...aaaaab
.*?b
on aaaaaaaa...aaaaab
.*b
on abaaaaaaa...aaaaa
.*?b
on abaaaaaaa...aaaaa
Essentially, greedy matching is best when you think you will need to match a lot. Lazy matching is best when you need to match only a little.
Note that if you change the regex to a*b
or a*?b
, then the engine may optimize things considerably.
Brute force testing
There are several frameworks that are specifically designed to try to find vulnerabilities in your regexes. It may be worthwhile to try one out.
There's really one thing that I will suggest if you wanted to try making your own algorithm. It's not practical to try all characters in the dictionary, especially if you want to test long strings.
Instead, look at your regex to determine what characters you should test. If you have (a+)+
as your regex, there are really only two things that go into the match: a
and not a
. You could really just imagine that there are only two characters: a
and b
(aka not a
) when you generate your strings to brute force with.
Setting timeouts
It would be fantastic to be able to ensure your regex finishes before the heat death of the universe, right? Some regex engines do have a way to set a time out.
AppDomain domain = AppDomain.CurrentDomain;
// Set a timeout interval of 2 seconds.
domain.SetData("REGEX_DEFAULT_MATCH_TIMEOUT", TimeSpan.FromSeconds(2));
Runnable runnable = new Runnable() {
@Override
public void run()
{
long startTime = System.currentTimeMillis();
Matcher interruptableMatcher = pattern.matcher(new InterruptibleCharSequence(input));
interruptableMatcher.find(); // runs for a long time!
System.out.println("Regex took:" + (System.currentTimeMillis() - startTime) + "ms");
}
};
Thread thread = new Thread(runnable);
thread.start();
Thread.sleep(500);
thread.interrupt();
bool set_time_limit ( int $seconds )
Set the number of seconds a script is allowed to run. If this is reached, the script returns a fatal error. The default limit is 30 seconds or, if it exists, the max_execution_time
value defined in the php.ini
.
When called, set_time_limit()
restarts the timeout counter from zero. In other words, if the timeout is the default 30 seconds, and 25 seconds into script execution a call such as set_time_limit(20)
is made, the script will run for a total of 45 seconds before timing out.
You might as well visit the link, since it's right on Stack Overflow.