How to parse a regular expression?

Question

Disclaimer before this is auto-closed. This is NOT the same as this:

How do you access the matched groups in a JavaScript regular expression?

Let's say I have this regular expression:

const regex = /(\w+) count: (\d+)/

Is there a way I can extract the capture groups so that I have:

[ '\w+', '\d+' ]`

Out of curiosity, why do you need this? `/(\w+) count: (\d+)/.toString().match(/(?<=\().+?(?=\))/g)` works on this trivial example but regex itself isn't a regular language so it seems like the job of a parser. — ggorlen, Feb 09 '21 at 21:50
Regular expressions all the way down. `/(\w+) count: (\d+)/.toString()` — Slava Knyazev, Feb 09 '21 at 21:50
@ggorlen it's for a complicated UI that allows users to supply a regex and describe how to alter the text in a match. We want to display the specific capture group related to the text they will be altering. Kind of hard to explain, but they are spec requirements — Troncoso, Feb 09 '21 at 21:53
Thanks. So you're basically building a regex engine? I don't think this is really a good task for regex. — ggorlen, Feb 09 '21 at 21:55
No, not quite. What we are doing with the provided regex is fairly simple. But the spec just threw in to show these individual capture groups with no thought to the complexity of the task. — Troncoso, Feb 09 '21 at 21:58
Yeah but you sort of need to build a regex parser to do what you want to do. You can have arbitrarily nested and escaped parens, so the complexity of a regex prohibits parsing it with itself very easily. See [this](https://stackoverflow.com/questions/172303/is-there-a-regular-expression-to-detect-a-valid-regular-expression) for starters. — ggorlen, Feb 09 '21 at 22:00

Peter Thoeny · Answer 1 · 2021-02-10T03:26:24.550

As others pointed out you'd need a real parser, such as Lex & Yacc. You can however use regex and some recursion magic to parse nested structures. See details at https://twiki.org/cgi-bin/view/Blog/BlogEntry201109x3

Here is a JavaScript version that can parse nested groups properly. The default test is (\w+) count: (\d+), number: (-?\d+(\/\d+)?), e.g. three groups at level 0, and one group nested at level 1 in the third group:

    // configuration:
    const ctrlChar = '~'; // use non-printable, such as '\x01'
    const cleanRegex = new RegExp(ctrlChar + '\\d+' + ctrlChar, 'g');

    function parseRegex(str) {

        function _levelRegx(level) {
            return new RegExp('(' + ctrlChar + level + ctrlChar + ')\\((.*?)(' + ctrlChar + level + ctrlChar + ')\\)', 'g');
        }

        function _extractGroup(m, p1, p2, p3) {
            //console.log('m: ' + m + ', p1: ' + p1 + ', p2: ' + p2 + ', p3: ' + p3);
            groups.push(p2.replace(cleanRegex, ''));
            let nextLevel = parseInt(p1.replace(/\D/g, ''), 10) + 1;
            p2 = p2.replace(_levelRegx(nextLevel), _extractGroup);
            return '(' + p2 + ')';
        }

        // annotate parenthesis with proper nesting level:
        let level = 0;
        str = str.replace(/(?<!\\)[\(\)]/g, function(m) {
            if(m === '(') {
                return ctrlChar + (level++) + ctrlChar + m;
            } else {
                return ctrlChar + (--level) + ctrlChar + m;
            }
        });
        console.log('nesting: ' + str);

        // recursively extract groups:
        let groups = [];
        level = 0;
        str = str.replace(_levelRegx(level), _extractGroup);
        console.log('result: ' + str);
        console.log('groups: [ \'' + groups.join('\', \'') + '\' ]');
        $('#regexGroups').text(JSON.stringify(groups, null, ' '));
    }

    $('document').ready(function() {
        let str = $('#regexInput').val();
        parseRegex(str);

        $('#regexInput').on('input', function() {
            let str = $(this).val();
            parseRegex(str);
        });
    });

div, input {
  font-family: monospace;
}

<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.0/jquery.min.js"></script>
<div>
<p>Regex: <input id="regexInput" value="(\w+) count: (\d+), number: (-?\d+(\/\d+)?)" size="60" />
<p>Groups: <span id="regexGroups"></span></p>
<p>.<br />.<br />.</p>
</div>

You can try it out with various nested patterns.

Explanation:

step 1: annotate opening and closing parenthesis with proper nesting level:
- the annotation is done with control character ~
- in real live use a non-printable char to avoid collision
- the result for (\w+) is ~0~(\w+~0~)
- the result of the default input is ~0~(\w+~0~) count: ~0~(\d+~0~), number: ~0~(-?\d+~1~(\/\d+~1~)?~0~)
step 2: recursively extract groups:
- we start with level 0, and extract all groups at that level
- for each matched group we recursively extract all groups at that next level

@Troncoso: Any questions? Does this fit your needs? – Peter Thoeny Feb 10 '21 at 23:48 — Peter Thoeny, Feb 10 '21 at 23:48

How to parse a regular expression?

1 Answers1

Linked