3

I am using Apache regexp as library to validate regex. I want to know how to track if some regex is causing catastrophic backtracking. What I want to know is, is there some trick to catch which regex and the string value that causing catastrophic backtracking? I tried a little modify in class RE.java, but not as expected.

This my modification:

    public RE(String pattern) throws RESyntaxException
{
    this(pattern, MATCH_NORMAL);
    paramString = pattern;
}


public RE(String pattern, int matchFlags) throws RESyntaxException
{
    this(new RECompiler().compile(pattern), matchFlags);
    paramString = pattern;
}


int callcounterMN = 0;
protected int matchNodes(int firstNode, int lastNode, int idxStart)
{
    callcounterMN++;
    if (callcounterMN == 100) {
        try {
            String pc1 = new Exception().getStackTrace()[5].getClassName();
            if (pc1.indexOf("UpdateWebForm") > 1)     
                System.out.println("regex loop reach "+callcounterMN+"  with regex : "+paramString+" "+this.search.substring(0));
        } catch (Exception e) {}
    }
JSON C11
  • 11,272
  • 7
  • 78
  • 65
tukimin
  • 53
  • 6
  • Instead of trying to guess the string I'd set a timeout for each regex call that may cause CA issue. – Wiktor Stribiżew Jan 25 '16 at 06:59
  • yes, that is another way. but i still want to track this issue. in case something wrong with my regex. but, can you give me some reference for set a timeout for each regex? maybe it useful later. thanks – tukimin Jan 25 '16 at 07:31
  • See [*Cancelling a long running regex match?*](http://stackoverflow.com/questions/910740/cancelling-a-long-running-regex-match) – Wiktor Stribiżew Jan 25 '16 at 07:35
  • thanks for the reference. but still finding a way to track the CA issue. – tukimin Jan 25 '16 at 07:42

1 Answers1

1

Much later, but given that there is still no answer I might as well pitch in: Google's RE2 regex library aims to prevent catastrophic backtracking issues altogether, sometimes at the cost of some level of performance: https://github.com/google/re2/wiki/WhyRE2

This is not exactly an answer to your question, in that this is about accepting any regex and ensuring that it never causes catastrophic backtracking hangs, rather than detecting those that will using the Apache library, but hopefully it's still useful input to some proportion of visitors to this question. If you can afford to not support some regex pattern constructs, you can afford the performance hit in some cases, and you can afford to test and swap out the library you are using for this one - then you might have solved the problem.

Tao
  • 13,457
  • 7
  • 65
  • 76