0

I'm writing a simple script parser in Javascript, and for the tokenizing part of the lexer I wanted to use Regex.

There are certain tokens I'm looking for, like (including the quotes):

  • "last-name"
  • "first-name"
  • "staff-id"

I also look for horizontal whitespace and vertical whitespace.

Finally, I look for whatever else is not matched by those tokens and white spaces.

The Regex would look something like:

("last-name"|"first-name"|"staff-id")|([\t ]+)|([\r\n]+)|(.+?)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   ^^^^^^^^^^^^^^^^   ^^^
 tokens                                whitespace         catch-all

But I'm having a problem with the "catch-all" at the end: (.+?) The resulting capture for this last part is one character each.

What I wanted to do was capture everything together in that catch-all instead of one character at a time. I've Googled around and looked at stackoverflow answers, like the following:

One solution I can do is concatenate all the "catch-all" results, one character at a time. For this particular project, that's fine, but for another one I'd rather have a Regex solution that could capture everything else in a "catch-all", if that's even possible.

So how can I capture "everything else" that I haven't already matched in a Regex?

Jay Tennant
  • 181
  • 10
  • @anubhava You're right, that would catch everything, but for the lexer the order of the captures are important, and .replace() would lose this order. Is there a sequence you suggest to use .replace() that would maintain order? – Jay Tennant Apr 30 '21 at 20:56

1 Answers1

2

You could do string.split(/("last-name"|"first-name"|"staff-id")|([\t ]+)|([\r\n]+)/g) so you have a sequence of [catch-all, ...groups, catch-all, ...] starting and ending with a catch-all. Although it can be an empty string.

const test = `name "last-name"\n"first-name" Lucy`.split(/("last-name"|"first-name"|"staff-id")|([\t ]+)|([\r\n]+)/g);


for (let i = 0; i < test.length;) {
  if (i & 1) {
    console.log("match", {
      "tokens": test[i],
      "[\\t ]+": test[i + 1],
      "[\\r\\n]+": test[i + 2],
    });
    i += 3
  } else {
    console.log({
      "rest": test[i],
    });
    ++i;
  }
}
Thomas
  • 11,958
  • 1
  • 14
  • 23
  • Huh, that's really neat and simple! I think this coupled with @anubhava's comment nicely answers the question. Thanks! – Jay Tennant Apr 30 '21 at 21:05
  • Okay, actually this solves the entire problem on its own. I didn't realize .split() inserted captured regular expression values (including undefined) into the resultant array, and that's what you were showing. This function is great! – Jay Tennant Apr 30 '21 at 21:20