0

I tried to implement a simple property-path tokenizer, so the result can be later calculated fast.

Here's my initial implementation:

function tokenize(path: string): (string | number)[] {
    const res = [], reg = /\[\s*(\d+)|["']([^"']+)["']\s*]|[a-z_$0-9]+/gi;
    let a;
    while (a = reg.exec(path)) {
        res.push(a[1] ? parseInt(a[1]) : a[3] || a[2] || a[0]);
    }
    return res;
}

It can take an input like this: first.a.b[123].c['prop1'].d["prop2"].last, and produce the following fast-resolution array:

['first', 'a', 'b', 123, 'c', 'prop1', 'd', 'prop2', 'last']

The problem I'm having is with adding support for nested quotes - ' and ", for an input like this: first["a\'b"].second['"'].

More precisely, I cannot figure out how to take one of the solutions here, and inject them into my regex. Those solutions work fine on their own, just not as part of my own regex, so joining the two expressions into one is the problem that I'm stuck with.

vitaly-t
  • 24,279
  • 15
  • 116
  • 138
  • Time to stop using regex. This is exactly what regexp are _not_ for, write (or find) a normal tokenizer that understands the syntax of the language you're working with, and can deal with nesting, inline block comments, etc. And remember that you're most definitely not the first to need this: find the libraries that others have already written to do this job for you, so you don't have to spend time on rolling your own system and constantly updating it for edge cases as they come up. Looker for a "js tokenizer" and you'll find lots of options – Mike 'Pomax' Kamermans Jan 15 '21 at 04:18
  • @Mike'Pomax'Kamermans Thanks, but I like RegEx, and I like writing my own tokenizers, if I can avoid bringing in extra dependencies. – vitaly-t Jan 15 '21 at 04:28

1 Answers1

2

Match and capture the character set quote. Then you can repeat any character but the captured quote with a negative lookahead inside a quantifier, then match the quote again.

If you need to handle backslashes before the same delimiter between the quotes, you can alternate with any escaped character before matching a non-delimiter. This will repeatedly match:

  • Any escaped character, or
  • Any character which is not the captured delimiter
(?:\\.|(?!\2).)*

function normalize(path) {
    const res = [], reg = /\[\s*(\d+)(?=\s*\])|\[(["'])((?:\\.|(?!\2).)*)\2\]|[\w$]+/gi;
    let a;
    while (a = reg.exec(path)) {
        res.push(a[1] ? parseInt(a[1]) : a[3] || a[0]);
    }
    return res;
}
console.log(normalize(`first.a.b[123].c['prop1'].d["prop2"].last`));
console.log(normalize(`first["a\'b"].second['"']`));
console.log(normalize(`["one\"two"]`));
console.log(normalize(`['one\'two']`));

But I'd suggest using a true parser like Acorn instead if at all possible.

CertainPerformance
  • 356,069
  • 52
  • 309
  • 320
  • I'm still not sure what all this satanic magic all about, but it works! Thank you! Definitely beats bringing `Acorn`-s to the table over such a small piece of code :) – vitaly-t Jan 15 '21 at 04:27
  • One issue... while it works for `["one'two"]` and `['one"two']`, it doesn't work for `["one\"two"]` or `['one\'two']`. Any idea how to fix that? – vitaly-t Jan 15 '21 at 06:40
  • 1
    You'll need to *either* match any escaped character, *or* anything but the delimiter, see edit – CertainPerformance Jan 15 '21 at 14:32
  • Thank you for helping with my little tokenizer! It works perfectly now, with so few lines of code. I prefer it much more to bringing a large library for it, like `Acorn`. I used it my high-performance path-to-value module [path-value](https://github.com/vitaly-t/path-value). – vitaly-t Jan 16 '21 at 06:45