22

Say I'm running a service where users can submit a regex to search through lots of data. If the user submits a regex that is very slow (ie. takes minutes for Matcher.find() to return), I want a way to cancel that match. The only way I can think of doing this is to have another thread monitor how long a match is taking and use Thread.stop() to cancel it if necessary.

Member variables:

long REGEX_TIMEOUT = 30000L;
Object lock = new Object();
boolean finished = false;
Thread matcherThread;

Matcher thread:

try {
    matcherThread = Thread.currentThread();

    // imagine code to start monitor thread is here

    try {
        matched = matcher.find();
    } finally {
        synchronized (lock) {
            finished = true;
            lock.notifyAll();
        }
    }
} catch (ThreadDeath td) {
    // send angry message to client
    // handle error without rethrowing td
}

Monitor thread:

synchronized (lock) {
    while (! finished) {
        try {
            lock.wait(REGEX_TIMEOUT);

            if (! finished) {
                matcherThread.stop();
            }
        } catch (InterruptedException ex) {
            // ignore, top level method in dedicated thread, etc..
        }
    }
}

I've read java.sun.com/j2se/1.4.2/docs/guide/misc/threadPrimitiveDeprecation.html and I think this usage is safe since I'm controlling where ThreadDeath is thrown via synchronisation and handle it and the only damaged objects could be my Pattern and Matcher instances which will be discarded anyway. I think this breaks Thread.stop() because I'm not rethrowing the error, but I don't really want the thread to die, just abort the find() method.

I've managed to avoid using these deprecated API components so far, but Matcher.find() does not seem to be interruptible and can take a very long time to return. Is there any better way to do this?

7 Answers7

46

From Heritrix: (crawler.archive.org)

/**
 * CharSequence that noticed thread interrupts -- as might be necessary 
 * to recover from a loose regex on unexpected challenging input. 
 * 
 * @author gojomo
 */
public class InterruptibleCharSequence implements CharSequence {
    CharSequence inner;
    // public long counter = 0; 

    public InterruptibleCharSequence(CharSequence inner) {
        super();
        this.inner = inner;
    }

    public char charAt(int index) {
        if (Thread.interrupted()) { // clears flag if set
            throw new RuntimeException(new InterruptedException());
        }
        // counter++;
        return inner.charAt(index);
    }

    public int length() {
        return inner.length();
    }

    public CharSequence subSequence(int start, int end) {
        return new InterruptibleCharSequence(inner.subSequence(start, end));
    }

    @Override
    public String toString() {
        return inner.toString();
    }
}

Wrap your CharSequence with this one and Thread interrupts will work ...

Kris
  • 14,426
  • 7
  • 55
  • 65
  • 1
    It would be slightly faster if you moved the exception bit out of charAt, although the real problem is likely to be inefficient patterns rather than large target text. – Tom Hawtin - tackline May 26 '09 at 14:05
  • 1
    All backtracking regex algorithms are well known to become pathalogical if certain evil patterns are specified. As Tom says, it's unlikely that a huge regex run time would be caused by reading the input string but much more likely because of excessive backtracking in the regex algorithm. It might be that the regex could get stuck for ages before reading another character (I've seen this before in perl) so the interupt might not work for some time. – Benj Apr 11 '11 at 10:01
  • @Kris charAt is never get called. What could be the problem? please, see my question here http://stackoverflow.com/questions/17674839/android-matcher-find-infinite – NullPointerException Jul 16 '13 at 11:18
  • I agree, this can't work. Regexp matcher copies the charsequence to a string. – Snicolas Jan 22 '18 at 21:54
  • @Benj current Java implementation of Matcher does actually call charAt again even when backtracking, even in small strings carAt is hit millions of times. Just performed a test. In 2,3 seconds using the string xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx and the pattern (x+x+)+y it called charAt 90 million times (without heating first. I asumme that testing in proper environment following the guidelines for microbenchmarking would execute it even more times in the given period). – DGoiko Jan 29 '20 at 18:45
6

With a little variation it is possible to avoid using additional threads for this:

public class RegularExpressionUtils {

    // demonstrates behavior for regular expression running into catastrophic backtracking for given input
    public static void main(String[] args) {
        Matcher matcher = createMatcherWithTimeout(
                "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 2000);
        System.out.println(matcher.matches());
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, int timeoutMillis) {
        Pattern pattern = Pattern.compile(regularExpression);
        return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern, int timeoutMillis) {
        CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
                regularExpressionPattern.pattern());
        return regularExpressionPattern.matcher(charSequence);
    }

    private static class TimeoutRegexCharSequence implements CharSequence {

        private final CharSequence inner;

        private final int timeoutMillis;

        private final long timeoutTime;

        private final String stringToMatch;

        private final String regularExpression;

        public TimeoutRegexCharSequence(CharSequence inner, int timeoutMillis, String stringToMatch, String regularExpression) {
            super();
            this.inner = inner;
            this.timeoutMillis = timeoutMillis;
            this.stringToMatch = stringToMatch;
            this.regularExpression = regularExpression;
            timeoutTime = System.currentTimeMillis() + timeoutMillis;
        }

        public char charAt(int index) {
            if (System.currentTimeMillis() > timeoutTime) {
                throw new RuntimeException("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
                                + regularExpression + "' on input '" + stringToMatch + "'!");
            }
            return inner.charAt(index);
        }

        public int length() {
            return inner.length();
        }

        public CharSequence subSequence(int start, int end) {
            return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch, regularExpression);
        }

        @Override
        public String toString() {
            return inner.toString();
        }
    }

}

Thanks a lot to dawce for pointing me to this solution in answer to an unnecessary complicated question !

Community
  • 1
  • 1
Andreas
  • 121
  • 1
  • 5
  • 1
    +1 Suggestion: `currentTimeMillis()` is a pretty expensive operation. Add a counter and only call it every Nth time `charAt()` is called. – Aaron Digulla Apr 07 '14 at 09:42
  • Great answer. Anyone using this will want to throw a custom exception rather than RuntimeException though. – Alkanshel Jun 15 '17 at 17:58
  • Yeah, indeed it was a great solution but limited to throw only RTE. If I want to send my custom exception I think we need to create one more wrapper on top of TimeoutRegexCharSequence class. – Codex Oct 16 '18 at 06:48
1

A long-running pattern matching process can be stopped using the below method.

  • Create StateFulCharSequence class which manages the state of pattern matching. When that state is changed, an exception is thrown on the next call to charAt method.
  • That state change can be scheduled using ScheduledExecutorService with a required timeout.
  • Here pattern matching is happening in the main thread and there is no need to check the thread interrupt state every time.

    public class TimedPatternMatcher {
    public static void main(String[] args) {
        ScheduledExecutorService executorService = Executors.newScheduledThreadPool(1);
        Pattern pattern = Pattern.compile("some regex pattern");
        StateFulCharSequence stateFulCharSequence = new StateFulCharSequence("some character sequence");
        Matcher matcher = pattern.matcher(stateFulCharSequence);
        executorService.schedule(stateFulCharSequence, 10, TimeUnit.MILLISECONDS);
        try {
            boolean isMatched = matcher.find();
        }catch (Exception e) {
            e.printStackTrace();
        }
    
    }
    
    /*
    When this runnable is executed, it will set timeOut to true and pattern matching is stopped by throwing exception.
     */
    public static class StateFulCharSequence implements CharSequence, Runnable{
        private CharSequence inner;
    
        private boolean isTimedOut = false;
    
        public StateFulCharSequence(CharSequence inner) {
            super();
            this.inner = inner;
        }
    
        public char charAt(int index) {
            if (isTimedOut) {
                throw new RuntimeException(new TimeoutException("Pattern matching timeout occurs"));
            }
            return inner.charAt(index);
        }
    
        @Override
        public int length() {
            return inner.length();
        }
    
        @Override
        public CharSequence subSequence(int start, int end) {
            return new StateFulCharSequence(inner.subSequence(start, end));
        }
    
        @Override
        public String toString() {
            return inner.toString();
        }
    
        public void setTimedOut() {
            this.isTimedOut = true;
        }
    
        @Override
        public void run() {
            this.isTimedOut = true;
        }
    }}
    
prabhu
  • 21
  • 2
  • 3
  • This solution doesn't really solve the problem, what if the charAt takes a long time? so you cannot guarantee the timout defined –  Oct 24 '19 at 11:11
  • @lssilva charAt is called millions of times per second, even in short strings. Do your own tests if you don't trust me. – DGoiko Jan 24 '20 at 21:07
  • @DGoiko I just did a test, create a method that generates a random string (https://www.geeksforgeeks.org/generate-random-string-of-given-size-in-java/) and run the code:println(System.currentTimeMillis()) println(getAlphaNumericString(Int.MaxValue - 1000).charAt(900000)) println(System.currentTimeMillis()) it took almost 1min –  Jan 27 '20 at 07:19
  • @lssilva I meant reading chars foir regexp. Take code from https://stackoverflow.com/a/11348374/9465588 and set a counter in charAt to see how many times it is read in the desired amount of time. You can test any long-running regexp you can think of with it. – DGoiko Jan 28 '20 at 21:07
0

Maybe what you need is a new lib which implements the NFA algorithm.

The NFA algorithm is hundreds times faster than the algorithm which is used by Java standard library.

And the Java std lib is sensitive to the input regexp, which may make your problem happen -- some input make the CPU run for years.

And the timeout can be set by the NFA algorithm through the steps it uses. It is effective than the Thread solution. Trust me I use thread timeout to a relative problem, it is horrible for performance. I finally fix the problem by modify the main loop of my algorithm implement. I insert some check point to the main loop to test the time.

The detail can be found here: https://swtch.com/~rsc/regexp/regexp1.html .

wuranbo
  • 101
  • 7
0

I included a counter in order to check every n reads of charAt, in order to reduce the overhead.

Notes:

Some people stated that carAt may not be call frequently enough. I just added the foo variable in order to demostrate how much charAt is called, and that it is frequent enough. If you're going to use this in production, remove that counter, as it will decrease performance and end up overflowing long if ran in a server for long time. In this example, charAt is called 30 million times every 0.8 secs or so (not tested with proper microbenchmarking conditions, it is just a proof of concept). You can set a lower checkInterval if you want higher precission, at the cost of performance (System.currentTimeMillis() > timeoutTime is more expensive than the if clause on the long run.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.goikosoft.test.RegexpTimeoutException;

/**
 * Allows to create timeoutable regular expressions.
 *
 * Limitations: Can only throw RuntimeException. Decreases performance.
 *
 * Posted by Kris in stackoverflow.
 *
 * Modified by dgoiko to  ejecute timeout check only every n chars.
 * Now timeout < 0 means no timeout.
 *
 * @author Kris https://stackoverflow.com/a/910798/9465588
 *
 */
public class RegularExpressionUtils {

    public static long foo = 0;

    // demonstrates behavior for regular expression running into catastrophic backtracking for given input
    public static void main(String[] args) {
        long millis = System.currentTimeMillis();
        // This checkInterval produces a < 500 ms delay. Higher checkInterval will produce higher delays on timeout.
        Matcher matcher = createMatcherWithTimeout(
                "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 10000, 30000000);
        try {
            System.out.println(matcher.matches());
        } catch (RuntimeException e) {
            System.out.println("Operation timed out after " + (System.currentTimeMillis() - millis) + " milliseconds");
        }
        System.out.print(foo);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, long timeoutMillis,
                                                      int checkInterval) {
        Pattern pattern = Pattern.compile(regularExpression);
        return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis, checkInterval);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern,
                                                    long timeoutMillis, int checkInterval) {
        if (timeoutMillis < 0) {
            return regularExpressionPattern.matcher(stringToMatch);
        }
        CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
                regularExpressionPattern.pattern(), checkInterval);
        return regularExpressionPattern.matcher(charSequence);
    }

    private static class TimeoutRegexCharSequence implements CharSequence {

        private final CharSequence inner;

        private final long timeoutMillis;

        private final long timeoutTime;

        private final String stringToMatch;

        private final String regularExpression;

        private int checkInterval;

        private int attemps;

        TimeoutRegexCharSequence(CharSequence inner, long timeoutMillis, String stringToMatch,
                                  String regularExpression, int checkInterval) {
            super();
            this.inner = inner;
            this.timeoutMillis = timeoutMillis;
            this.stringToMatch = stringToMatch;
            this.regularExpression = regularExpression;
            timeoutTime = System.currentTimeMillis() + timeoutMillis;
            this.checkInterval = checkInterval;
            this.attemps = 0;
        }

        public char charAt(int index) {
            if (this.attemps == this.checkInterval) {
                foo++;
                if (System.currentTimeMillis() > timeoutTime) {
                    throw new RegexpTimeoutException(regularExpression, stringToMatch, timeoutMillis);
                }
                this.attemps = 0;
            } else {
                this.attemps++;
            }

            return inner.charAt(index);
        }

        public int length() {
            return inner.length();
        }

        public CharSequence subSequence(int start, int end) {
            return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch,
                                                regularExpression, checkInterval);
        }

        @Override
        public String toString() {
            return inner.toString();
        }
    }

}

And the custom exception, so you can catch only THAT exception to avoid swalowing other RE Pattern / Matcher may throw.

public class RegexpTimeoutException extends RuntimeException {
    private static final long serialVersionUID = 6437153127902393756L;

    private final String regularExpression;

    private final String stringToMatch;

    private final long timeoutMillis;

    public RegexpTimeoutException() {
        super();
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String message, Throwable cause) {
        super(message, cause);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String message) {
        super(message);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(Throwable cause) {
        super(cause);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String regularExpression, String stringToMatch, long timeoutMillis) {
        super("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
                + regularExpression + "' on input '" + stringToMatch + "'!");
        this.regularExpression = regularExpression;
        this.stringToMatch = stringToMatch;
        this.timeoutMillis = timeoutMillis;
    }

    public String getRegularExpression() {
        return regularExpression;
    }

    public String getStringToMatch() {
        return stringToMatch;
    }

    public long getTimeoutMillis() {
        return timeoutMillis;
    }

}

Based on Andreas' answer. Main credits should go for him and his source.

DGoiko
  • 346
  • 2
  • 19
0

What about checking the user-submitted regex for "evil" patterns prior to execution using one or more regex patterns (this could be in to form of a method called prior to conditional execution of the regex):

This regex:

\(.+\+\)[\+\*]

Will match:

(a+)+
(ab+)+
([a-zA-Z]+)*

This Regex:

\((.+)\|(\1\?|\1{2,})\)\+

Will match:

(a|aa)+
(a|a?)+

This Regex:

\(\.\*.\)\{\d{2,}\}

Will match:

(.*a){x} for x \> 10

I may be a bit naive wrt Regex and Regex DoS, but I can't help but think that a little pre-screening for known "evil" patterns would go a long way toward preventing issues at execution time, especially if the regex in question is an input provided by an end user. The patterns above are likely not refined enough, since I am far from an expert on regex. It is just food for thought, since everything else I have found out there seems to indicate it can't be done, and focuses on either putting a time-out on the regex engine, or limiting the number of iterations it is allowed to execute.

0

Another workaround would be to limit the region of the matcher, then call find(), repeating until the thread is interrupted or a match is found.

erickson
  • 265,237
  • 58
  • 395
  • 493
  • Only if you can guarantee the size and boundaries of your potential matches. That would probably not apply to the question, which is around arbitrary expressions. – AndrewF Jun 25 '20 at 20:50