4

I'm parsing a moderately complex grammar using Javascript and I'd like to use regular expressions to match tokens such as numbers.

Given a string containing the grammar, a regular expression representing a number (say) and an offset within the string, I'd like to find out if the regular expression matches the string at that offset exactly.

I can set lastIndex, call RegExp.exec and check the index property of the resulting match to see if the match happened at the expected offset, but this is very inefficient because exec will search the entire string if it doesn't find a match at the starting offset.

The Javascript spec says "A Pattern evaluates ("compiles") to an internal procedure value. RegExp.prototype.exec can then apply this procedure to a String and an offset within the String to determine whether the pattern would match starting at exactly that offset within the String."

This is exactly what I want, but there seems to be no way of accessing this internal function. Does anyone know if there is?

P.S. I'm currently avoiding this problem by splitting the input string into an array of tokens, but I'd prefer not to.

arx
  • 16,686
  • 2
  • 44
  • 61
  • I'm not sure if I understand you correctly, but can't you instead use `str.substring(index)` and match the regexp on that to see whether it matched at index `0`? – pimvdb Oct 21 '11 at 17:46
  • @pimvdb - see my answer to Rob below – arx Oct 21 '11 at 17:56

3 Answers3

8

I've thoroughly tested possibly efficient methods, see JSPerf: ~20000 chars, ~1000000 chars. I've created a function to generate a random string consisting of alphanumeric characters. After running this function once, a RegExp pattern is created, to match a string with length 10 at a given offset.

Tested cases (when the condition in if(..) is true, the pattern is found at offset index):

var string = "...about 20000 characters: A-Z a-z 0-9...";
var regexp = /abcdef1324/g;
var regexpSubstr = /^abcdefg1234/;
var index = 10000;

/*1*/ if ( regexpSubstr.exec(string.substr(index,10)) ) ;
/*2*/ regexp.lastIndex = index;
      var found =  regexp.exec(string);
      if (found && found.length + index == regexp.lastIndex ) ;

/*3*/ if ( regexpSubstr.test(string.substr(index,10)) ) ;
/*4*/ // Only when the RegExp matches a fixed number of characters
      regexp.lastIndex = index;
      if ( regexp.test(string) && regexp.lastIndex == index + 10 ) ;

Case 1 and Case 3 are equivalent, because they're checking whether a substring matches the /^abcdef1234/ pattern (does the selected substring start with "abc..etc"?).

Case 2 and Case 4 use the .lastIndex method:
    1.   Set the .lastIndex property of the RegExp to the desired offset
    2.   Check whether the pattern is found.
    3.   Check whether a found pattern is positioned at offset index.
These methods require a Regular expression to have enabled the global flag.

At very large strings, method 4 (lastIndex + test) is proved to be most efficient whn a match at the offset occurs. Method 4, however, requires the matched pattern to have a pre-defined, fixed size.

Method 3 (substr + test) is slightly slower than 4, when a match occurs at the given position. When no match is found at a large string, however, method 3 is significantly faster than method 4. Method 1 and method 3 seems to be equally fast when no match is found.

RegExp methods
The .exec seems to not be more efficient than .test. The match method is not suitable for this case, since it attempts to find all matches, regardless of the .lastIndex property. Another possible RegExp function is the .search function, which is significantly slower for large strings, compared to previously shown methods.

Rob W
  • 341,306
  • 83
  • 791
  • 678
  • For preference, I'd like to scan the entire input in a single pass without creating lots of ad-hoc substrings. This is certainly possible, but I'm not sure if it's possible to do it efficiently in Javascript and use regular expressions. Though, actually, regular expressions also create lots of ad-hoc substrings, so perhaps this is the best solution. – arx Oct 21 '11 at 17:54
  • @arx If you know that your starting offset is `123354...`, it's more efficient to take the substring from position 123354...., rather than including offset 0 to 123354... in the RegExp search query. – Rob W Oct 21 '11 at 18:01
  • This isn't a problem. I can start the search at 123354 using the lastIndex property of the regular expression. I want to avoid searching the rest of the string if I don't get a match, a problem you also solve but at the cost of creating a new string. However, as I said above I'm beginning to think this is a reasonable price. – arx Oct 21 '11 at 18:05
  • @arx Updated answer with thoroughly tested cases. – Rob W Oct 22 '11 at 12:36
  • Thanks for the thorough analysis. However, I'm concerned about the case where the match doesn't happen. For example, if you use the recommended method 4 and it doesn't find a match it will scan through 1000000-index characters, which will certainly be slower than methods 1 or 3. Were you only testing cases where there was a match at the given offset? – arx Oct 22 '11 at 20:22
  • @arx When there's no match at a given position, method "**Case 3**" is significantly faster than method 4. I've updated my answer to include links to these test cases. – Rob W Oct 22 '11 at 20:54
  • Thank you. I've upvoted your answer for all the benchmarking but I've accepted sln's because it most closely answers the question. – arx Oct 25 '11 at 19:35
  • This is a good answer, but the comparison is a little unfair. AFAICS - the `lastIndex` approach searches through the whole string. On the other hand, the `substr` approach really just looks at the offset because of the caret in its regexp. I wonder if caret-matches would work with `lastIndex`, or there was a dedicated `match` method like in Python, would it be faster than `substr`. – Eli Bendersky Jun 21 '13 at 00:58
3

If you can set the search start position, you may be able to take advantage
of the end anchor in an assertion to terminate on a real non-match.

It requires that you accept the fact that the only time it wont match (on the surface)
is when that start position falls at the end of the string.

Then post process to check the length of the capture buffer. It depends on the regex,
but the likely fail is if the capture buffer is zero length.

Example:
(\d{6} | (?!$)) or ( (?:any subexpression) | (?!$) )

this will always match (unless the starting position is at the end of string, or
the string is empty).

Outcomes
- Does not match: The string is empty, or starting position is at the end of string.
- Matches: %99.999 of the time. If length of capture buffer is 0 (ie: ''), either the
left side of the alternation failed, or it passed by capturing nothing, regex dependent.

The start position is taken care of, however, static length matches in regex's are sometimes dificult (not impossible) to control with quantifiers. Its probably more
reasonable to use regex on substrings in the case of open ended quantifiers.

0

Perhaps new RegExp("^.{" + offset + "}whatever"), i.e. any character exactly offset times, followed by whatever? Needs benchmarking, but one would hope RegExp would optimize the "any character exactly offset times" part?

Szczepan Hołyszewski
  • 2,707
  • 2
  • 25
  • 39