12

I have already parsed a string up to index idx. My next parse step uses a Regexp. It needs to match the next part of the string, i.e. staring from position idx . How do I do this efficiently?

For example:

let myString = "<p>ONE</p><p>TWO</p>"
let idx

// some code not shown here parses the first paragraph
// and updates idx
idx = 10

// next parse step must continue from idx 
let myRegex = /<p>[^<]*<\/p>/
let subbed = myString.substring(idx)
let result = myRegex.exec(subbed)
console.log(result) // "<p>TWO</p>", not "<p>ONE</p>"

But myString.substring(idx) seems like a quite expensive operation.

Are there no regex operations like this: result = myRegex.execFromIndex(idx, myString);?

In general, I want to start regex matching from different indexes so I can exclude parts of the string and avoid matches that are already parsed. So one time it can be from myString[0] another time myString[51] and so on.

Is there a way to do this efficiently? I'm parsing hundreds of thousands of lines and want to do this in an as cheap way as possible.

Inigo
  • 12,186
  • 5
  • 41
  • 70
mottosson
  • 3,283
  • 4
  • 35
  • 73
  • 1
    typo `myString.length` – Pranav C Balan Jan 18 '17 at 15:45
  • 1
    if you really are concerned about efficiency, try not to use regex. – Faibbus Jan 18 '17 at 15:47
  • 4
    Construct the regex instance and then set its `.lastIndex` property. [Read the documentation.](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/lastIndex) – Pointy Jan 18 '17 at 15:49
  • @Faibbus What do you suggest instead? I'm quite sure I won't be able to write a more efficient search than regex on my own. – mottosson Jan 18 '17 at 15:54
  • This question might get better answers if you would provide the logic by which you decide you need to search from a certain index only. – trincot Jan 18 '17 at 15:57
  • @trincot Thanks. Is it better now? – mottosson Jan 18 '17 at 16:05
  • Well, it doesn't help to get better answers. I was referring to potential code you might have that shows how you already parsed some part. It is likely that you could improve that code so it integrates the requirement you have here. – trincot Jan 18 '17 at 16:17
  • Though it's 5 years old, this is a good question that many might benefit from, so I updated your question, turning your example into a [mre] and reorged the wording a bit (got to the point quicker, moved the rest to after the example), and I wrote an answer that also gets to the point very clearly. You can execute your example and the solution in my answer to see for yourself. Hopefully this finally gets an accepted answer after 5 years. – Inigo Feb 09 '22 at 16:12

2 Answers2

8

A JavaScript Regexp has a lastIndex property that is used in Regexp.exec() as a placeholder that contains the index of the last match, show it knows where to start next. So setting myRegex.lastIndex = 3 should solve your problem.

It's more efficient than substring method because it doesn't need to create an extra variable and setting the lastIndex property is probably a quicker operation than doing a substring. Everything else is the exactly the same as you were doing.

Below is a test since that shows that setting lastIndex will produce the same result as doing the substring first.

var result1Elem = document.getElementById('result1');
var result2Elem = document.getElementById('result2');
var runBtn = document.getElementById('RunBtn');
runBtn.addEventListener("click", runTest);
function runTest() {
  var substrStart = +document.getElementById('substrStartText').value
  var myRegex1 = new RegExp(document.getElementById('regexText').value, 'g');
  myRegex1.lastIndex = substrStart;
  var myRegex2 = new RegExp(document.getElementById('regexText').value, 'g');

  var myString1 = document.getElementById('testText').value;
  var myString2 = myString1.substring(3);
  
  var result;
  
  var safety = 0;
  while ((result = myRegex1.exec(myString1)) !== null) {
    result1Elem.innerHTML += '<li>' + result[0] + ' at ' + result.index + '</li>';
    if (safety++ > 50) break;
  }
  
  safety = 0;
  while ((result = myRegex2.exec(myString2)) !== null) {
    result2Elem.innerHTML += '<li>' + result[0] + ' at ' + (result.index + substrStart)  + '</li>';
    if (safety++ > 50) break;
  }
}
<table>
<tr><td>Test </td><td> <input type="text" value="Hello World" id="testText" /></td></tr>
<tr><td>Regex </td><td> <input type="text" value="l." id="regexText" /></td></tr>
<tr><td>Substring Start </td><td> <input type="text" value="3" id="substrStartText" /></td></tr>
<tr><td colspan="2"><button id="RunBtn">Run</button></td></tr>
</table>

<table style="width:100%">
  <tr style="font-weight:bold; background:#ccc">
    <td>Results of Regex with lastIndex = 3</td>
    <td>Results of string substringged</td>
  </tr>
  <tr>
    <td><ul id="result1"></ul></td>
    <td><ul id="result2"></ul></td>
  </tr>
<table>
Daniel Gimenez
  • 18,530
  • 3
  • 50
  • 70
  • Thanks, but it's not quite what I need since I don't do any regex matches on the beginning of the row, and the lastIndex will be 0 – mottosson Jan 18 '17 at 17:56
  • I'm a little confused as to what's wrong. Doing a substring from index 51 and setting the lastIndex to 51 will produce the same results – Daniel Gimenez Jan 18 '17 at 18:01
  • Yes, but is it efficient? Since I'm parsing hundreds of thousands of lines I need it to be really fast and memory efficient. And my question was if there where a better alternative than using substring. – mottosson Jan 18 '17 at 18:02
  • 1
    weird thread this. It's a very valid question that hardly has any upvotes and this seems like a very good answer but not recognized as such by the person asking the question? What my question would be is if `lastIndex` is standardized? But it certainly seems that it is: https://tc39.es/ecma262/#sec-lastindex – Stijn de Witt Jan 09 '21 at 18:42
  • Wow, thanks @StijndeWitt for turning my attention to this question after so many years! A little bit cringe seeing my beginner comments :) I believe that the regex example in the answer will parse the entire string input and then move the lastIndex to the start position, but my ambition was to not have to parse the part before start to save some performance. And when using substring it will create a copy of the string which also costs performance. Just FYI, I now use Rust where I have full control and can use string slices to access any part of the string and which is zero copy :) – mottosson Jan 09 '21 at 22:12
  • @StijndeWitt, I was scratching my head as well due the oddness of this question-gem and unaccepted answer – DanielCuadra Feb 22 '22 at 21:43
  • @mottosson Rust is very cool, but your assumptions about how JS `substring` and `slice` work are wrong. See the second part of my answer. – Inigo Feb 23 '22 at 09:49
7

Use Regexp.exec and lastIndex

  1. Create a Regexp with the y or g flag
    • with the y flag, the match must start exactly at the specified start index
    • with the g flag, the match can occur anywhere after the specified index
  2. Set its lastIndex property to the start index
  3. Call exec

I've applied the above steps to your example code:

let myString = "<p>ONE</p><p>TWO</p>"
let idx

// some code not shown here parses the first paragraph
// and updates idx
idx = 10

// next parse step must continue from idx 
let myRegex = /<p>[^<]*<\/p>/y  // note the 'y' flag!
myRegex.lastIndex = idx
let result = myRegex.exec(myString)
console.log(result) // "<p>TWO</p>", not "<p>ONE</p>"

Another useful thing to know is that exec will update lastIndex to point to the position in the string after the returned match. This allows you to do many things, including:

  1. Rerun the same Regexp, which will automatically find the next match after that last match.
  2. Transfer the lastIndex value to a different Regexp, if the next thing you want to parse has a different pattern.
  3. Copy the lastIndex value into a variable used by your non-regex parsing.
  4. Return lastIndex to the caller of your function, so the caller can proceed with the rest of the string however it wants.

Why string.slice and substring are good solutions too

But myString.substring(idx) seems like a quite expensive operation.

Not necessarily so! Although they probably won't be as fast as Rust, all the leading Javascript engines (SpiderMonkey, V8, JavaScriptCore) do exactly what you describe for Rust. They optimize string.slice and substring behind the scenes, using pointers into the source string rather than making copies.

Adventures in the land of substrings and RegExps has a lot of great detail, pictures and analysis, but it is five years old and things have likely gotten even better since. There is the answer to this StackOverflow question: Is Javascript substring virtual?

Inigo
  • 12,186
  • 5
  • 41
  • 70