8

Wondering about techniques for terminating long running regular expression matches (java matcher.find() method). Maybe subclassing Matcher and adding some logic to terminate after x number of iterations?

Basically I'm generating regular expressions using a genetic algorithm, so I don't have a lot of control over them. Then I test each one against some text to see if they match a certain target area of the text.

So since I'm sort of randomly generating these regular expressions, I get some crazy stuff going on, and it eats a ton of cpu and some find() calls take a while to terminate. I'd rather just kill them after a while, but not sure of best way to do that.

So if anyone has ideas, please let me know.

VLAZ
  • 26,331
  • 9
  • 49
  • 67
Fraggle
  • 8,607
  • 7
  • 54
  • 86

6 Answers6

3

There is a solution here which would solve your problem. (That question is the same problem yours is.)

Essentially, its a CharSequence that can notice thread interrupts.

The code from that answer:

/**
 * CharSequence that noticed thread interrupts -- as might be necessary 
 * to recover from a loose regex on unexpected challenging input. 
 * 
 * @author gojomo
 */
public class InterruptibleCharSequence implements CharSequence {
    CharSequence inner;
    // public long counter = 0; 

    public InterruptibleCharSequence(CharSequence inner) {
        super();
        this.inner = inner;
    }

    public char charAt(int index) {
        if (Thread.interrupted()) { // clears flag if set
            throw new RuntimeException(new InterruptedException());
        }
        // counter++;
        return inner.charAt(index);
    }

    public int length() {
        return inner.length();
    }

    public CharSequence subSequence(int start, int end) {
        return new InterruptibleCharSequence(inner.subSequence(start, end));
    }

    @Override
    public String toString() {
        return inner.toString();
    }
}

Wrap your string with this and you can interrupt the thread.

Community
  • 1
  • 1
Reverend Gonzo
  • 39,701
  • 6
  • 59
  • 77
  • Looks promising. Easier than what I was about to try (replacing jdk Pattern class with my own version, and using Xbootclasspath to get it to load instead of the default). – Fraggle Aug 19 '11 at 19:56
  • However, some of the comments left in the other answer, lead me to think it may not always work. Worth a shot though. It makes the assumption that charAt will be called when we are stuck in these kinds of loops, which may not always be the case. – Fraggle Aug 19 '11 at 20:00
1

Just show another solution.

You can use the NFA algorithm which is not sensitive to input and hundreds of times faster than the Java standard library.

I think the sensitivity to input is the original reason which causes your problem.

You can check out the introduction here: Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)

I also answered a similar question with more detail here: Cancelling a long running regex match?

Community
  • 1
  • 1
wuranbo
  • 101
  • 7
1

A worst case scenario and one which may have people yelling at me is:

You can run the regex matching in another thread and if its running too long you can thread.stop() it.

John Vint
  • 39,695
  • 7
  • 78
  • 108
  • Right, I am running in a separate thread. I'm pretty sure I have tried thread.stop() before and it caused some weird problems. Of course its deprecated and all and we are told not to use it, etc. Have you used thread.stop() successfully with recent jvms? – Fraggle Aug 19 '11 at 18:41
  • Well it was a while ago since I tried. I think the JVM just died or something with no stack trace or anything. Could have been due to something else since little evidence was left behind. I could always give it another shot. – Fraggle Aug 19 '11 at 19:24
0

One possible solution, which has a nice thing that it doesn't block the main thread, would be to spawn off the "matching" in a separate thread. You can create a customized Callable which returns null after the duration/threshold has expired or the "match" result if it is successful.

Sanjay T. Sharma
  • 22,857
  • 4
  • 59
  • 71
  • This is only valid if the Matcher responds to `Thread.intterupt()` If it does this is a better solution if it doesn't its not a valid solution. – John Vint Aug 19 '11 at 18:29
  • I am already running these in separate threads, but not sure how that helps, because unless I can stop the threads, they keep running in the background. @Sanjay, I don't think interrupt() will magically stop these runaway threads, but maybe I'm wrong. – Fraggle Aug 19 '11 at 18:38
  • Youre not wrong. Thread.interrupt is only as useful as the application's responsiveness to interruption – John Vint Aug 19 '11 at 18:42
  • You folks are of course right. It completely slipped my mind that Matcher methods don't respond to `Thread.interrupt`. As a last resort, you can try monkey patching "Matcher" class methods by checking for `Thread.interrupt` in every loop. – Sanjay T. Sharma Aug 19 '11 at 18:56
  • Yeah, I'm looking at a thread dump now, and its basically a stack of calls to all these match methods inside the Pattern class. So I'm seeing like Pattern$GroupHead.match, followed by Pattern$Ques.match, then Pattern$Curly.match(. So what I may have to do is subclass Pattern and at the start of every match method, insert some code to have them all return false if some time has expired or maybe call thread.wait()? – Fraggle Aug 19 '11 at 19:14
  • No, just check for Thread interruption and if it has been interrupted, propagate the interruption. Look [here](https://bitbucket.org/sanjayts/java-snippets/src/1dc970ebb770/src/main/net/sanjayts/snippets/threading/ThreadStopTest.java) for a sample snippet. – Sanjay T. Sharma Aug 19 '11 at 19:21
0

You need to use another thread and stop it when it runs out of time.

There are two ways of stopping: Thread#stop() and Thread#interrupt().

Using Thread.stop() is rather dangerous, and Matcher does not respond to Thread.interrupt (answering to interrupt is an opt-in behavior).

BUT there is a really clever solution, some details are here. Use the provided InterruptibleCharSequence (it wraps yours string and works almost like one, BUT it adds support for Thread#interrupt()), then build your own Callable returning whatever matcher returns. Each runnable can be now executed using a FutureTask / ThreadPool combo, and you can get the result with any timeout you desire:

Boolean result = myMatchingTask().get(2, TimeUnit.SECONDS)

If you are in Java EE environment, you can skip the complicated part, just use the InterruptipleCharSequence and @Asynchronous calls.

If this sounds cryptic, ask for details.

Community
  • 1
  • 1
fdreger
  • 12,264
  • 1
  • 36
  • 42
-1

If I were you, I would make my own class that I would put between my application and the library you're using to match, and implement methods like "interrupt" that you need to kill the thread, and manage the matching that way.

Nick
  • 51
  • 4