1

I am looking for a way to let a regex.match() timeout on Android.

Background: I have an app using an IntentService to scrape HTML content and parse it with regex matchers. Sometimes the format of the HTML pages changes and then regex.match() operation makes my app hang.

I have tried this solution and also the following code I adapted from Google source:

public class RegexUtils {

    public RegexUtils() {
    }

    public void test() {
        long millis = System.currentTimeMillis();
        Matcher matcher = createMatcherWithTimeout("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 10000);
        try {
            Timber.d("RegexUtils: %s", (matcher.find() ? "Matches found" : "No matches found"));
        } catch (RuntimeException e) {
            Timber.w("RegexUtils: Operation timed out after " + (System.currentTimeMillis() - millis) + " milliseconds");
        }
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, long timeoutMillis) {
        Pattern pattern = Pattern.compile(regularExpression);
        return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern, long timeoutMillis) {
        if (timeoutMillis < 0) {
            return regularExpressionPattern.matcher(stringToMatch);
        }
        TimeoutCharSequence charSequence = new TimeoutCharSequence(stringToMatch, timeoutMillis);
        return regularExpressionPattern.matcher(charSequence);
    }

    private static class TimeoutCharSequence implements CharSequence {
        long expireTime = 0;
        CharSequence chars = null;
        TimeoutCharSequence(CharSequence chars, long timeout) {
            this.chars = chars;
            expireTime = System.currentTimeMillis() + timeout;
        }
        @Override
        public char charAt(int index) {
            if (System.currentTimeMillis() > expireTime) {
                throw new CharSequenceTimeoutException("TimeoutCharSequence was used after the expiration time.");
            }
            return chars.charAt(index);
        }
        @Override
        public int length() {
            return chars.length();
        }
        @Override
        public CharSequence subSequence(int start, int end) {
            return new TimeoutCharSequence(chars.subSequence(start, end), expireTime - System.currentTimeMillis());
        }
        @Override
        public String toString() {
            return chars.toString();
        }
        private static class CharSequenceTimeoutException extends RuntimeException {
            public CharSequenceTimeoutException(String message) {
                super(message);
            }
        }
    }
}

But using either methods charAt() is not called, therefore not throwing the timeout exception.

Any ideas on how to solve this are highly appreciated!! Thanks.

rob
  • 11
  • 2
  • 1
    Use a proper parser or make your regex more efficient. Also look into something like https://stackoverflow.com/q/20500003/2191572 – MonkeyZeus Jul 06 '22 at 12:40
  • @MonkeyZeus I wish I could make the regex more efficient, but I've run into cases where the HTML just changed too significantly. Both my links show Java solutions on how to set a timeout for regex and I am just not getting, why these solutions aren't working for me. Is it a change in the way Android does regex matching that charAt() isn't being called or is my code wrong? – rob Jul 06 '22 at 13:23
  • I'm no Java expert but none of your code seems to invoke `charAt()`. Also, you did not share your regex so suggesting improvements is impossible. Any specific reason that you're avoiding parsers like Jsoup? Is it even your regex that's causing the issue? Are you sure that the site you're scraping isn't timing out? – MonkeyZeus Jul 06 '22 at 13:56
  • @MonkeyZeus Well, as my content to be matched is wrapped in a `TimeoutCharSequence` which extends `CharSequence`, I was under the impression that `charAt()` is called when using a matcher, as this is implied by the 2 links I provided in my original post. I will look into the Jsoup option, but in connection with OAuth login etc. I fear it will get a bit messy. Therefore I would prefer a way to just timeout my regex operation. For now, I will probably just use the `ExecutorService` way you suggested in your first comment, however I feel there must be a cleaner way to solve this?! – rob Jul 06 '22 at 17:05

0 Answers0