2

I am trying to use VS Code's tokenization engine for grammar injections and I don't understand why some regular expressions fail.

For example, suppose I have the following text.

VS Code, TextMate grammars, and Oniguruma regular expressions. 

Then, I want to match Oniguruma using the following regex (i.e., see demo):

(?=and\s+(Oniguruma)\s+regular)

Based on the demo above, the regular expression seems to match (capture?) what I want (i.e., see below).

demo matching

However, when trying this in the context of VS Code grammars, it fails. More specifically, the ./syntaxes/some.test.injection.json file contains:

{
    "scopeName": "some.test.injection",
    "injectionSelector": "L:text.html.markdown",
    "patterns": [
        { "include": "#test" }
    ],
    "repository": {
        "test": {
            "match": "(?=and\\s+(Oniguruma)\\s+regular)",
            "captures": {
                "1": { "name" : "some.test" }
            }
        }
    }
}

Then, in package.json I have:

{
    // ...
    "contributes": {
        "grammars": [
            {
                "scopeName": "some.test.injection",
                "path": "./syntaxes/some.test.injection.json",
                "injectTo": ["text.html.markdown"]
            }
        ]
    },
    // ...
}

Finally, the token color rule in settings.json looks like this:

{
    "editor.tokenColorCustomizations": {
        "textMateRules": [
            { "scope": "some.test", "settings": { "foreground": "#dfd43b" } },
        ]
    }
}

As you can see below, the token is not parsed:

enter image description here

However, the token gets parsed when I use the following regex (i.e., see demo) instead:

(?<=and\s)(Oniguruma)(?=\s+regular)

As seen during the inspection of the editor token and scopes:

enter image description here

From the VS Code documentation (i.e., see below) I understand that I need to use Oniguruma regular expressions:

TextMate grammars rely on Oniguruma regular expressions and are typically written as a plist or JSON. You can find a good introduction to TextMate grammars here, and you can take a look at existing TextMate grammars to learn more about how they work.

My question is twofold:

  1. Why does the first expression fail? Is it not a valid Oniguruma regular expression?
  2. How can I test whether a regular expression is a valid Oniguruma regular expression?
Mihai
  • 2,807
  • 4
  • 28
  • 53
  • you ONLY have a Positive Lookadhead Assertion, this is a zero-length position, so nothing is matched, match length is 0, read more about Regex – rioV8 Apr 16 '22 at 17:04
  • Thank you for your very insightful comment. Then how do you explain that the capturing works [here](https://regex101.com/r/7svfsq/2)? – Mihai Apr 16 '22 at 20:16
  • if you use `captures` you can use `and\s(Oniguruma)\s+regular` no need to use lookahead or look behind – rioV8 Apr 16 '22 at 20:39
  • My question is not about needing a `regex` expression. I am interested to know why the capture I can get in the first place is not working in VS Code and to what extent it is or is not a valid `Oniguruma` expression. – Mihai Apr 16 '22 at 20:56
  • I would say it is a bug in regex101, you don't have capture groups in lookahead/behind, because it is not part of the matched text, If you write it according to the rules of the `Oniguruma` docs it is an `Oniguruma` regex – rioV8 Apr 16 '22 at 21:13
  • Of course it is not a bug at regex101. It is a peculiarity of the editor feature you are using for highlighting. – Wiktor Stribiżew Apr 16 '22 at 21:14
  • @WiktorStribiżew, if I understand you correctly, it's either (1) capturing inside a positive lookahead is not a valid expression for the `Oniguruma` engine, or (2) VS Code's (i.e., via TextMate grammars) version of the `Oniguruma` engine does not support this syntax, hence the peculiarity. I believe (2) is more likely, and I also found a [mention on Wikipedia](https://en.wikipedia.org/wiki/Oniguruma) about an updated version, i.e., `Onigmo`, that introduces more features. – Mihai Apr 16 '22 at 21:31
  • @rioV8, [here is another counterexample](https://stackoverflow.com/a/71650015/5252007) that using capture groups in lookaheads is acceptable. I too don't agree that there is a bug in `regex101`. Instead, I think VS Code uses an older version of the engine that does not support this syntax. – Mihai Apr 16 '22 at 21:49
  • 1
    It is not the problem of the regex library but the peculiarity of the software that uses the regex library. Capturing inside positive lookaheads works fine in any regex flavor that supports lookarounds. – Wiktor Stribiżew Apr 16 '22 at 22:00
  • I see, and sadly this makes it harder to build an expression on `regex101` because I do not know how it will behave in VS Code. For example, even this one fails `\|\s+(\d[m])\s+\|` to capture `0m`, whereas, for whatever strange reason, this one succeeds `(?<=\|)(?:\s+)(\d[m])(?=\s+\|)`. Thank you for shedding some light on this issue. – Mihai Apr 16 '22 at 22:16

2 Answers2

2

VSCode uses TextMate as the tokenization engine, and TextMate uses the oniguruma engine for Regex.

Ruby 1.9+ uses the oniguruma engine. And Rubular uses ruby 2.5.9

I've been using Rubular to validate my VSCode TM grammars for a while and has never it failed once.

ghaschel
  • 1,313
  • 3
  • 20
  • 41
-2

I'm pretty sure your regex is being overridden by another one that is also present in the .tmLanguage.json file.
In order to check this, do the following:
in the file where you write regex (it is assumed that other patterns are also located in it) find(Ctrl + F) the following textmate scope: "text.html.markdown" (as shown in you in the screenshot), then with your regex completely replace the one that is registered for this scope and change the name to "some.test", then reload VSCode.