4

Is it possible to return all the repeating and matching subgroups from a single call with a regular expression?

For example, I have a string like :

{{token id=foo1 class=foo2 attr1=foo3}}

Where the number of attributes (i.e. id, class, attr1) are undefined and could be any key=value pair.

For example, at the moement, I have the following regexp and output

var pattern = /\{{([\w\.]+)(?:\s+(\w+)=(?:("(?:[^"]*)")|([\w\.]+)))*\}\}/;
var str = '{{token arg=1 id=2 class=3}}';

var matches = str.match(pattern);
// -> ["{{token arg=1 id=2 class=3}}", "token", "class", undefined, "3"]

It seems that it only matches the last group; Is there any way to get all the other "attributes" (arg and id)?

Note: the example illustrate match on a single string, but the searched pattern be be located in a much larger string, possibly containing many matches. So, ^ and $ cannot be used.

Yanick Rochon
  • 51,409
  • 25
  • 133
  • 214
  • so you want both the key and value in a single regex? – Amit Joki Mar 19 '14 at 15:51
  • provide some test cases and what the output should be – Amit Joki Mar 19 '14 at 15:53
  • yes, I want all of them returned. I thought I could have sub-groups. Perhaps a sub-array of matches, or something. – Yanick Rochon Mar 19 '14 at 15:54
  • @AmitJoki, There's already a test case, whatever the expected output, I want the regexp to match and return (perhaps in sub-groups) **all** the "attributes", and not only the last one. – Yanick Rochon Mar 19 '14 at 15:55
  • Not relevant to your problem, but I think you have one more grouping than you need. Seems like this `"(?:[^"]*)"` could just be this `"[^"]*"` – cookie monster Mar 19 '14 at 16:07
  • So you want the result to be something like this, right? `["{{token arg=1 id=2 class=3}}", "token", "arg", undefined, "1", "id", undefined, "2" "class", undefined, "3"]` – cookie monster Mar 19 '14 at 16:11
  • Well anyway, I'm not a regex pro, but I'm pretty sure you're going to be doing this in two passes. One to get each `{{...}}` group *(which can have subgroups of the 'token' and all the 'attributes')*, then another pass to break down the attributes for each group. – cookie monster Mar 19 '14 at 16:21

3 Answers3

2

This is impossible to do in one regular expression. JavaScript Regex will only return to you the last matched group which is exactly your problem. I had this seem issue a while back: Regex only capturing last instance of capture group in match. You can get this to work in .Net, but that's probably not what you need.

I'm sure you can figure out how to do this in a regular expressions, and the spit the arguments from the second group.

\{\{(\w+)\s+(.*?)\}\}

Here's some javaScript code to show you how it's done:

var input = $('#input').text();
var regex = /\{\{(\w+)\s*(.*?)\}\}/g;
var match;
var attribs;
var kvp;
var output = '';

while ((match = regex.exec(input)) != null) {
    output += match[1] += ': <br/>';

    if (match.length > 2) {
        attribs = match[2].split(/\s+/g);
        for (var i = 0; i < attribs.length; i++) {
            kvp = attribs[i].split(/\s*=\s*/);
            output += ' - ' + kvp[0] + ' = ' + kvp[1] + '<br/>';       
        }
    }
}
$('#output').html(output);

jsFiddle

A crazy idea would be to use a regex and replace to convert your code into json and then decode with JSON.parse. I know the following is a start to that idea.

/[\s\S]*?(?:\{\{(\w+)\s+(.*?)\}\}|$)/g.replace(input, doReplace);

function doReplace ($1, $2, $3) {
  if ($2) {
    return "'" + $2 + "': {" + 
      $3.replace(/\s+/g, ',')
        .replace(/=/g, ':')
        .replace(/(\w+)(?=:)/g, "'$1'") + '};\n';       
    }
   return '';
 }

REY

Community
  • 1
  • 1
Daniel Gimenez
  • 18,530
  • 3
  • 50
  • 70
  • Yes, I remember doing this in .Net at some point, maybe why I got confused about subgroups and all (.Net has corrupted my JavaScript!). I'll just match the entire `key=value` attribute list, then, and parse them separately in that case. – Yanick Rochon Mar 19 '14 at 17:43
  • Yes, this is a clever idea (the second one). With a little modifications, it is possible to generate a JSON string that can be parsed. :) Note, however that I've decided to drop using regexp in favor of a state machine parser; more flexibility and error management. But you totally suggested an innovative approach to this problem. – Yanick Rochon Mar 20 '14 at 17:12
0

You could do this:

var s = "{{token id=foo1 class=foo2 attr1=foo3 hi=we}} hiwe=wef";
var matches = s.match(/(\w+(?==\w+)|(?!==\w+)\w+)(?!\{\{)(?!.*token)(?=.*}})/g);
matches.splice(0,1);
for (var i = 0; i < matches.length; i++) {
    alert(matches[i]);
}

The regex is /(\w+(?==\w+)|(?!==\w+)\w+)(?!\{\{)(?!.*token)(?=.*}})/g (Use global modifier g to match all attributes)

The array will look like this:

["id","foo1","class","foo2","attr1","foo3","hi","we"]

Live demo: http://jsfiddle.net/HYW72/1/

Amit Joki
  • 58,320
  • 7
  • 77
  • 95
  • That's far too loose. It'll include matches outside the `{{}}` and makes no check for `token`. – cookie monster Mar 19 '14 at 16:01
  • Still matches stuff outside the `}}` http://jsfiddle.net/cmjwy/3/ And you're not capturing the values that OP wants. – cookie monster Mar 19 '14 at 16:09
  • @cookiemonster, see OP's commment http://stackoverflow.com/questions/22511031/javascript-regexp-repeating-subgroup/22511524#comment-34251012 – Amit Joki Mar 19 '14 at 16:10
  • Yes, I saw that comment. You're only capturing the attribute name, not the value as well, nor the starting token *(which presumably isn't literally the word "token")* as shown in the question. – cookie monster Mar 19 '14 at 16:13
  • I really don't think you're understanding what he wants. This is presumably for some sort of templating system, and he needs every grouping of `{{word attr=val attr2=val2}}`, where there could be multiple groupings in a large string, and the values of the attributes may or may not be quoted. http://jsfiddle.net/cmjwy/5/ – cookie monster Mar 19 '14 at 16:19
  • Sorry, but no. It picks up matches outside the `{{}}` boundaries. http://jsfiddle.net/HYW72/2/ – cookie monster Mar 19 '14 at 16:29
  • @AmitJoki, so, basically you're suggesting a two pass approach? – Yanick Rochon Mar 19 '14 at 16:40
  • @YanickRochon, I'm coming up with a single regex. – Amit Joki Mar 19 '14 at 16:42
0
str = "{{token id=foo1 class=foo2 attr1=foo3}}"
if lMatches = str.match(///^
        \{\{
        ([a-z][a-z0-9]*)   # identifier
        (
            (?:
                \s+
                ([a-z][a-z0-9]*)  # identifier
                =
                (\S*)             # value
                )*
            )
        \}\}
        $///)
    [_, token, attrStr] = lMatches

    hAttr = {}
    for lMatches from attrStr.matchAll(///
            ([a-z][a-z0-9]*)  # identifier
            =
            (\S*)             # value
            ///g)
        [_, key, value] = lMatches
        hAttr[key] = value

    console.log "token = '#{token}'"
    console.log hAttr
else
    console.log "NO MATCH"

This is CoffeeScript - because it's SO much easier to read. I hate it when .NET gets something right that JavaScript just fails on, but you have to match the entire string of attribute/value pairs in one regexp, then, you have to parse that to get what you want (matchAll(), which returns an iterator, is handy here). The /// style regexp runs until the next /// and makes whitespace not significant, which also allows comments. There are lots of assumptions here, like keys are identifiers, only lower-case letters, values are any run of non-whitespace, including empty, attribute names are unique, etc. but they're easily modified.

FYI, the above code outputs:

token = 'token'
{ id: 'foo1', class: 'foo2', attr1: 'foo3' }
John Deighan
  • 4,329
  • 4
  • 18
  • 19