Java : Matcher.find using high cpu utilization

Question

I am using mod security rules https://github.com/SpiderLabs/owasp-modsecurity-crs to sanitize user input data. I am facing cpu shoot up and delay in matching the user input with mod security rule regular expressions. Overall it contains 500+ regular expression to check different types of attacks(xss , badrobots , generic and sql). For each request , I go through all parameters and check against all these 500 regular expressions. I am using Matcher.find to check the parameters. In this case some parameters fall in infinite looping , I tackled this using below technique.

Cancelling a long running regex match?.

Sanitize a user request took around ~500 ms and cpu % shoots up. I analyzed using visualvm.java.net with my test suite runner.

Cpu Profile Output

enter image description here

Please help me to reduce the cpu usage % and load average?

According to the screenshot `checkPattern` was called 212148825 times totalling to 6100774ms, that makes 0.02ms per call. I don't see a performance problem there - and definitely no proof of 500ms per invocation. — , Aug 31 '13 at 22:07
If there are particular patterns causing longer delays you should identify them and include them in your question. — Holger, Sep 17 '13 at 17:15
@Holger time taken is not a problem. My concern is only about load and the cpu usage. I want to process parallel process the parameters , If I did that my load average went to > 4. I took thread dump using `jstack -l` and find the max consume thread using `thread -H -b -p ` and converted the id into hex code , thread which consumes high cpu (50 %) are in runnable state at Matcher.find. — kannanrbk, Sep 18 '13 at 19:05
@bharathi: That’s still the same kind of problem. It’s very likely that a lot of the patterns are cheap but some of them are expensive. You have to identify the expensive ones to concentrate on optimizing these specific patterns. By the way, `Matcher.find` invokes other methods which are not shown due to the profiler’s default filtering rules. Changing these rules to allow looking inside the JDK’s methods can cast light on where inside the matching most of the time (implies cpu load) is spent. — Holger, Sep 19 '13 at 07:53

score 3 · Answer 1 · edited May 23 '17 at 11:47

3

If possible, compile your regexes once and keep them around, rather than repeatedly (implicitly) compiling (especially inside a loop).
See java.util.regex - importance of Pattern.compile()? for more info.

edited May 23 '17 at 11:47

Community

1
1

answered Sep 18 '13 at 23:06

Edward

1,000
5
16

I'm already doing this. Precompiled all matching patterns and stored it's a pattern list. – kannanrbk Sep 24 '13 at 11:00

Ryan · Answer 2 · 2013-09-20T18:03:03.997

I suggest you look at this paper: "Towards Faster String Matching for Intrusion Detection or Exceeding the Speed of Snort"

There are better ways to do the matching you describe. Essentially you take the 500 patterns you want to match and compile it into a single suffix tree which can very efficiently match an input against all the rules at once.

The paper explains that this approach was described as "Boyer-Moore Approach to Exact Set Matching" by Dan Gusfield.

Boyer-Moore is a well known algorithm for String matching. The paper describes a variation on Boyer-Moore for Set Matching.

score 3 · Answer 3 · answered Sep 23 '13 at 11:31

I think this is the root of your problem, not the regex performance per-se:

For each request , I go through all parameters and check against all these 500 regular expressions

No matter how fast your regex will be, this is still plenty of work. I don't know how many parameters you have, but even if there are only a few, that's still checking thousands of regular expressions per request. That can kill your CPU.

Apart from the obvious things like improving your regex performance by precompiling and/or simplifying them, you can do the following things to reduce the amount of regex checking:

Use positive-validation of user input based on the parameter type. E.g. if some parameter must be a simple number, don't waste time checking if it contains malicious XML script. Just check whether it matches [0-9]+ (or something similarly simple). If it does, it is ok - skip checking all the 500 regexps.
Try to find simple regexps that could eliminate the whole classes of attacks - find common things in your regexps. If e.g. you've got 100 regexps checking for existence of certain HTML tags, check if the content contains at least one HTML tag first. If it doesn't, you immediately save on checking 100 regexps.
Cache results. Many parameters generated in webapps repeat themselves. Don't check the same content over and over again, but just remember the final validation result. Beware to limit the maximum size of the cache to avoid DOS attacks.

Also note that negative-validation is usually easy to bypass. Someone just changes a few chars in their malicious code and your regexps won't match. You'll have to grow your "database" of regexps in order to protect against new attacks. Positive validation (whitelisting) doesn't have this disadvantage and is much more effective.

Hi, I did the first two steps. Now , the total execution time is better than previous. Gained around 60ms. — kannanrbk, Sep 24 '13 at 11:04

score 2 · Answer 4 · answered Aug 31 '13 at 21:14

2

Avoid expresions with:

Multiline
case insensitive
etc.

Perhaps you can consider grouping regular expressions and apply a given group of regular expresions depending on user input.

answered Aug 31 '13 at 21:14

pabloa98

630
7
16

This is plain wrong. Why not multiline / case insensitive? Regex matching in Java is based on NFA + backtracking, so things like case-sensitiveness do not affect performance much. Much more important is to avoid backtracking, e.g. .* followed by alteration (a|b|c). – Piotr Kołaczkowski Sep 24 '13 at 12:12
As you said, case-sensitiveness and multiline search do not affect the performance much. That is another way to declare that they affect the performance. Based in the performance requirements it could be relevant or no. If performance is required, backtracking must not be used. Never. – pabloa98 Sep 28 '13 at 21:42

score 1 · Answer 5 · answered Sep 19 '13 at 14:05

1

If you have such a big number of regex, you could group (at least some of) them using a trie algorithm (http://en.wikipedia.org/wiki/Trie).
The idea is that if you have for example regexes like /abc[0-9-]/, /abde/, /another example/, /.something else/ and /.I run out of ideas/, you can combine them into the single regex

 /a(?:b(?:c[0-9-]|de)|nother example)|.(?:I run out of ideas|something else)/

In this way, the matcher has to run only once instead of four times, and you avoid a lot of backtracking, because of how the common starting parts have been written in the regex above.

answered Sep 19 '13 at 14:05

davide

1,918
2
20
30

Hi davide, I can't group it. Because , I need to get the matched rule(mod security rule , each rules has it's own attributes) details. – kannanrbk Sep 24 '13 at 11:02
In principle, if the matched rules are just some among the 500, you could prepare *packages* of regexes, using the above procedure to make the a big regex per package. When one of the big regexes finds a match, you can check the original rules that forms the package. To make this method effective, you should cluster the rules that are more likely to appear together. I hope its feasible. – davide Sep 24 '13 at 11:14

score 1 · Answer 6 · answered Sep 20 '13 at 14:37

There must be a subset of problematic regexes among these 500. I.e. such a regex

    String s = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB";

    Pattern.compile("(A+)+").matcher(s).matches();

will take years to complete.

So in your case I would log all the problematic regexes with their problematic inputs. Once these are found you could manually rewrite these few problematic regexes and test them versus the original. Regexes can always be rewritten with a simpler and more readable java functions.

Another option, though it would not resolve the problem above, is you could also utilize a faster (x20 in some cases) and more limited regex library. It is available in Maven Central.

Java : Matcher.find using high cpu utilization

6 Answers6