How to define regexp variables in TM language?

Question

In sublime-syntax file you can define variables to use in regular expressions (like - match: "{{SOME_VARIABLE}}"). It looks like you can't in tmLanguage (https://macromates.com), but highlighters frequently expand variables, then is there an utility that adds variable support like this for the TM language descriptor, so it can be used with VSCode? I found nothing with the search engine.

score 3 · Answer 1 · answered Jul 10 '20 at 19:27

I too was looking for this functionality as the regular expressions get long and complex very quickly, especially if writing the tmLanguage file in JSON, which forces you to escape some characters with \\.

It seems not to be supported out of the box by textmate. However you can have variable support if you don't mind some pre-processing.

I found this kind of solution browsing Microsoft TypeScript TmLanguage GitHub repository. They define the Typescript grammar in YAML, which is more readable and requires only one anti-slash to escape characters. In this YAML file, they define "variables" for frequently used patterns, e.g.:

variables:
  startOfIdentifier: (?<![_$[:alnum:]])(?:(?<=\.\.\.)|(?<!\.))
  endOfIdentifier: (?![_$[:alnum:]])(?:(?=\.\.\.)|(?!\.))
  propertyAccess: (?:(\.)|(\?\.(?!\s*[[:digit:]])))
  propertyAccessPreIdentifier: \??\.\s*
  identifier: '[_$[:alpha:]][_$[:alnum:]]*'
  constantIdentifier: '[[:upper:]][_$[:digit:][:upper:]]*'
  propertyIdentifier: '\#?{{identifier}}'
  constantPropertyIdentifier: '\#?{{constantIdentifier}}'
  label: ({{identifier}})\s*(:)

Then they reuse those "variables" in the pattern definitions (or even in other variables, if you look above, the label variable uses the identifier variable), e.g.:

enum-declaration:
    name: meta.enum.declaration.ts
    begin: '{{startOfDeclaration}}(?:\b(const)\s+)?\b(enum)\s+({{identifier}})'
    beginCaptures:
      '1': { name: keyword.control.export.ts }
      '2': { name: storage.modifier.ts}
      '3': { name: storage.modifier.ts}
      '4': { name: storage.type.enum.ts }
      '5': { name: entity.name.type.enum.ts }

And finally they use a build script to transform this YAML grammar to a plist or json grammar. In this build script, they remove the "variables" property from the grammar as it is not part of the tmLanguage spec and they loop over the variables definitions to replace their occurrences ({{variable}}) in other variables or begin, end, match patterns.

function replacePatternVariables(pattern: string, variableReplacers: VariableReplacer[]) {
    let result = pattern;
    for (const [variableName, value] of variableReplacers) {
        result = result.replace(variableName, value);
    }
    return result;
}

type VariableReplacer = [RegExp, string];
function updateGrammarVariables(grammar: TmGrammar, variables: MapLike<string>) {
    delete grammar.variables;
    const variableReplacers: VariableReplacer[] = [];
    for (const variableName in variables) {
        // Replace the pattern with earlier variables
        const pattern = replacePatternVariables(variables[variableName], variableReplacers);
        variableReplacers.push([new RegExp(`{{${variableName}}}`, "gim"), pattern]);
    }
    transformGrammarRepository(
        grammar,
        ["begin", "end", "match"],
        pattern => replacePatternVariables(pattern, variableReplacers)
    );
    return grammar;
}

Not exactly what you (and I) were looking for but if your grammar is big enough, it helps. If the grammar is not quite big enough, then I would not use this pre-processing.

Hydroper · Answer 2 · 2023-06-14T13:52:41.857

I made a command line tool for converting a YAML format of TMLanguage syntax with support for these variables to JSON: https://www.npmjs.com/package/com.matheusds365.vscode.yamlsyntax2json (GitHub repo)

For more information on the TMLanguage format and creating language extensions for Visual Studio Code, look at this StackOverflow answer.

You can refer to variables using {{variableName}} syntax.

Install it with NPM:

npm i -g com.hydroper.tmlanguage.yamlsyntax2json

Here is an example:

# tmLanguage
---
$schema: https://raw.githubusercontent.com/martinring/tmlanguage/master/tmlanguage.json
name: MyLanguageName
scopeName: source.mylang

variables:
  someVar: 'xxx'

patterns:
  - include: '#foo'

repository:
  foo:
    patterns: []

Run:

yamlsyntax2json mylanguage.tmLanguage.yaml mylanguage.tmLanguage.json

Output:

{
    "$schema": "https://raw.githubusercontent.com/martinring/tmlanguage/master/tmlanguage.json",
    "name": "MyLanguageName",
    "patterns": [
        {
            "include": "#foo"
        }
    ],
    "repository": {
        "foo": {
            "patterns": []
        }
    },
    "scopeName": "source.mylang"
}

Nice try, but I am afraid there is at least one bug. "captures" are ignored since they are not instances of Array: https://github.com/hydroper/tmlanguage-yaml2json/blob/eea10b9365f6d4c25d2ac1ea54976ae6805deda1/bin/yamlsyntax2json#L63 — Aristide, Jun 14 '23 at 12:20
@Aristide Makes sense since captures are maps... Fixed and published — Hydroper, Jun 14 '23 at 13:54
Thanks. I ended up writing my own version in Python. The main work is done by the standard library yaml. — Aristide, Jun 14 '23 at 14:56

score 1 · Answer 3 · answered Jun 14 '23 at 15:00

Inspired by the other answers, I have written this Python function:

import json
import yaml
import re
from pathlib import Path

def convert(source: Path, target: Path):

    def finalize_regexes(d):
        if isinstance(d, dict):
            for k, v in d.items():
                if k in ("match", "begin", "end"):
                    v = re.sub(r"\{\{(\w+)\}\}", lambda m: variables[m[1]], v)
                    d[k] = v.replace(" ", "") # Warning: suppresses spaces in regexes
                else:
                    finalize_regexes(v)
        elif isinstance(d, list):
            for i in d:
                finalize_regexes(i)

    def compress_single_key_dictionaries(json_text):
        return re.sub(r"(?m)^((?:\s*|.+: )\{)\n\s*(.+)\n\s*(\})", r"\1 \2 \3", json_text)

    data = yaml.safe_load(source.read_text())
    variables = data.pop("variables", {})
    finalize_regexes(data)
    json_text = json.dumps(data, indent=2)
    json_text = compress_single_key_dictionaries(json_text)
    target.write_text(json_text)

How to define regexp variables in TM language?

3 Answers3