11

Is there a way to stop a runaway regular expression?

I am not interested in suggestions on how to modify it. I know it can be modified so it doesn't break, etc, but I am running a single regex against thousands of inputs, so modifying it means I need to retest it on all the inputs. Not very practical.

So the exact question is: is there some form of timer that I can use to terminate a regex that takes longer than X seconds to complete?

DavidO
  • 13,812
  • 3
  • 38
  • 66
user3688176
  • 327
  • 3
  • 13
  • 1
    You can't actually alarm out of the regex engine, from what my own testing has revealed to me. However, there is a module: Sys::SigAction which has a function, "`timeout_call`" that will interrupt the regex engine whilst in progress. However, interrupting the RE engine in progress is not safe, and can cause a seg-fault. – DavidO May 29 '14 at 16:19
  • @DavidO then simple fork should do. – mpapec May 29 '14 at 16:47
  • @mpapec ...by some definitions of "simple". ;) Yes, `fork` is sometimes a good alternative. I'll update my answer below to include it as a suggestion. – DavidO May 29 '14 at 16:48
  • @mpapec : No, I think the fork/alarm suggestion is very reasonable. It's too bad this question was on its way to being closed (hopefully that won't succeed), because it's not a bad question. The use-case may be of dubious merit, but there are legitimate cases where time-constrained matching is reasonable. BTW: I think threads have been deprecated in Perl-5.20. – DavidO May 29 '14 at 16:55
  • @DavidO re: `threads have been deprecated in Perl-5.20` world is coming to the end. :) – mpapec May 29 '14 at 16:59
  • @mpapec: "deprecated" was too strong of a word. Here's the quote from `perldelta` for 5.20: "*Interpreter-based threads are now discouraged...*" – DavidO May 29 '14 at 17:02
  • ###See also > [How can I safely validate an untrusted regex in Perl?](http://stackoverflow.com/questions/20357755/how-can-i-safely-validate-an-untrusted-regex-in-perl/20357964) – Mike Mestnik May 19 '15 at 00:32

1 Answers1

10

Perl's built-in alarm is insufficient for breaking out of a long running regular expression because Perl doesn't give opportunities for alarm timeouts inside of internal opcodes. alarm simply cannot penetrate it.

In some cases the most obvious solution is to fork a subprocess and time it out after it's gone on too long using alarm. This PerlMonks post demonstrates how to time-out a forked process: Re: Timeout on script

There is a Perl module on CPAN called Sys::SigAction that has a function called timeout_call, which will interrupt a long-running regular expression using unsafe signals. However, the RE engine wasn't designed to be interrupted, and can be left in an unstable state that may lead to seg-faults about 10% of the time.

Here is some example code that demonstrates Sys::SigAction successfully breaking out of the regex engine, as well as demonstrating that Perl's alarm is incapable of doing so:

use Sys::SigAction 'timeout_call';
use Time::HiRes;


sub run_re {
  my $string = ('a' x 64 ) . 'b';

  if( $string =~ m/(a*a*a*a*a*a*a*a*a*a*a*a*)*[^Bb]$/ ) {
    print "Whoops!\n";
  }
  else {
    print "Ok!\n";
  }
}

print "Sys::SigAction::timeout_call:\n";
my $t = time();
timeout_call(2,\&run_re);
print time() - $t, " seconds.\n";

print "alarm:\n";
$t = time();

eval {
  local $SIG{ALRM} = sub { die "alarm\n" };
  alarm 2;
  run_re();
  alarm 0;
};

if( $@ ) {
  die unless $@ eq "alarm\n";
}
else {
  print time() - $t, " seconds.\n";
}

The output will be something along the lines of:

$ ./mytest.pl
Sys::SigAction::timeout_call:
Complex regular subexpression recursion limit (32766) exceeded at ./mytest.pl line 11.
2 seconds.
alarm:
Complex regular subexpression recursion limit (32766) exceeded at ./mytest.pl line 11.
^C

You will notice that in the second call -- the one that is supposed to time out with alarm, I finally had to ctrl-C out of it because alarm was inadequate for breaking out of the RE engine.

The big caveat with Sys::SigAction is that even though it is capable of breaking out of a long-running regular expression, because the RE engine wasn't designed for such interruptions, the entire process can become unstable, leading to a segfault. While it doesn't happen every time, it can happen. This probably isn't what you want.

I don't know what your regular expression looks like, but if it fits within the syntax allowed by the RE2 engine, you can use the Perl module, re::engine::RE2 to work with the RE2 C++ library. This engine guarantees linear time searches, though it provides less powerful semantics than Perl's built-in engine. The RE2 approach avoids the whole issue in the first place by providing a linear-time assurance.

However, if you're unable to use RE2 (possibly because your regex's semantics are too demanding for it), the fork/alarm method is probably the safest way to assure you remain in control.

(By the way, this question, and a version of my answer were crossposted to PerlMonks.)

DavidO
  • 13,812
  • 3
  • 38
  • 66