2

This is a follow-up to that question. I've learned that finding overlapping regex matches in Python is not straight-forward, so decided to do an additional inquiry to see how Perl and Ruby stand up to this task.

I'd like to count the number of all possible matches of a regex against a certain string. And by "all" I mean that the result should take into account both overlapping and non-unique matches. Here are some examples:

  • a.*k should be matched twice in "akka"
  • "bbboob" tested against b.*o.*b should yield 6

As a reference, here's a Perl one-liner suggested by tchrist - it outputs the correct matches and their count:

() = "bbboobb" =~ /(b.*o.*b)(?{push @all, $1})(*FAIL)/g; printf "got %d matches: %s\n", scalar(@all), "@all";

The only problem with this is that it eats up too much resources for test cases where the resulting match count is in the order of millions or more. But I understand it is due to the fact that all the matches are first groupped and only counted afterwards. I'm looking for a resource-efficient solution that only returns the count.

Community
  • 1
  • 1
jankes
  • 207
  • 1
  • 7
  • If you have a regular expression in the computer science sense, this can easily be done with an NFA in O(RN) with R and N being the lengths of the regular expression and input strings). – Nabb Feb 18 '12 at 19:02
  • @Nabb But unless you use RE2 (which you actually can in Perl), you don’t get an NFA. You have a recursive backtracker. See the Russ Cox papers. – tchrist Feb 19 '12 at 01:32

1 Answers1

6

It looks like tchrist has done all the hard work. If storing the matches and counting them afterwards is eating too much resource, then you could just change the regex-embedded code to just count the matches:

my $count = 0;

"bbboobb" =~ /(b.*o.*b)(?{$count++})(*FAIL)/g;

print "got $count matches\n";
Community
  • 1
  • 1
zgpmax
  • 2,777
  • 15
  • 22
  • Maybe a non capturing group can help speed things up a little as well. Either `(?:b.*o.*b)` or the `/n` flag where supported. – Eily Jan 12 '18 at 16:35