1

Given I have a string that represents HTML-like attributes, e.g. 'attr="val" attr2="val2"', I'd like to get attribute names and values, yet it gets complicated as a value can contain space (thus no splitting by space is to do the work), as well it can contain both ' and " (note that a string itself can be surrounded with either ' or "), finally there can occur quotes preceded by backslash, i.e. \' or \". I managed to capture almost everything except the last one - a value containing \" or \'.

Regexp I've made as far is here: https://regex101.com/r/Z7q73R/1
What I aim at is to turn the string 'attr="val" attr2="val\"2a\" val2b"' into the object {attr: 'val', attr2: 'val"2a" val2b'}.

  • 1
    [Dangerous title](https://stackoverflow.com/a/1732454/2707792), but OK... Do you really want to do it by regex? Wouldn't it be easier to prepend / append ``, build a dom element from it, then enumerate the attributes, put them in a JS dictionary, and then convert to JSON, or something along those lines? – Andrey Tyukin May 20 '18 at 20:54
  • 1
    I would wrap all names in quotes by replacing `\w+=` with `"$0"=`, add `{` and `}` to the ends then parse as JSON. – Bohemian May 20 '18 at 20:56
  • @AndreyTyukin currently I've used the solution you mention, but if it could be achieved with regex, I'd prefer regex. Thanks anyway. – Damian Czapiewski May 20 '18 at 20:59
  • 1
    Try `(\w+)="((?:[^\\"]*(?:\\.[^\\"]*)*))"`. See demo here https://regex101.com/r/pOBj91/1 – revo May 20 '18 at 21:00
  • @DamianCzapiewski Honestly, I'm not so sure whether it's really a good idea. While this wheel is relatively easy to reinvent, it's also difficult to get right. Matching string literals is nasty. If you can use a robust existing framework that already does exactly that, then use it. – Andrey Tyukin May 20 '18 at 21:02
  • @revo that's it! Thank you!! You'd post it as answer so I could give thumb up. But I guess there remains enhancement to do - support for single quotes. – Damian Czapiewski May 20 '18 at 21:02
  • 1
    [*Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.* - Jamie Zawinski](http://regex.info/blog/2006-09-15/247) –  May 20 '18 at 22:16

3 Answers3

1

If we assume all attributes values are enclosed within double-quotes, names are consisted of word characters ([a-zA-Z0-9_]) and they are separated by an space character, at least... then below regex matches as expected:

(\w+)="([^\\"]*(?:\\.[^\\"]*)*)"

Breaking down [^\\"]*(?:\\.[^\\"]*)* chunk:

  • [^\\"]* Match any thing except backslash and "
  • (?: Start of non-capturing group
    • \\. Match an escaped character
    • [^\\"]* Match any thing except backslash and "
  • )* End of non-capturing group, repeat as many as possible

JS code:

var str = `'attr="val" attr2="val2"'`;
var re = /(\w+)="([^\\"]*(?:\\.[^\\"]*)*)"/g;

while ((m = re.exec(str)) !== null) {
    if (m.index === re.lastIndex)
        re.lastIndex++;
    console.log(m[1] + " => " + m[2])
}
revo
  • 47,783
  • 14
  • 74
  • 117
  • You're welcome. If you need to include values enclosed within single-quotes do this `(\w+)=(?:"([^\\"]*(?:\\.[^\\"]*)*)"|'([^\\']*(?:\\.[^\\']*)*)')` – revo May 20 '18 at 21:21
  • Yep, I did it yet another way: capture any quote and use the group identifer at the end as in the code I've posted. – Damian Czapiewski May 20 '18 at 21:25
0

Thanks to @revo, I've done proper code. I show it below for the sake of descedants.

const regex = /(\w+)=(?:"([^\\"]*(?:\\.[^\\"]*)*)"|'([^\\']*(?:\\.[^\\']*)*)')/gm;
const str = `attr1="\\'val\\'\\"1\\"" attr2='val2a \\'hello\\' \\"yo\\" val2b'`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    console.log(m[1] + ' => ' + ( m[2] ? m[2] : m[3] ))
}
revo
  • 47,783
  • 14
  • 74
  • 117
  • This would produce some problems. First of all it will capture beginning quotes as in `'attr="val"...` in capturing group that is supposed to hold an attribute name. Secondly, it will match whole string if input string is something like `attr1='val\"1\"' attr2='val2a val2b' attr3='val2a val2b'`. See in action https://regex101.com/r/zpcs0z/1 – revo May 20 '18 at 21:26
  • I involved the regex you mentioned in a comment to your answer and it also seems to not capture properly, see: https://regex101.com/r/vLf5Qy/1 . It makes mind blows to construct correct regex :) – Damian Czapiewski May 20 '18 at 21:30
  • Well your string is corrupted. It should be `attr1="\'val\'\"1\"" attr2='val2a \'hello\' \"yo\" val2b'`. You use one backslash to escape a character. Two backslash means a literal backslash. – revo May 20 '18 at 21:31
  • Yep, you're right, I must already be exhausted with those regexps today :) Once again, THANKS! You seem to be a genius as you solve this in a couple of minutes. – Damian Czapiewski May 20 '18 at 21:34
  • My pleasure. I live in regex tag. Once again, stick with the one from comments otherwise you should expect unwanted matches. – revo May 20 '18 at 21:42
  • @revo ok, I used recommended one, input is no longer corrupted, but results are not correct. Run snippet to see that :/ – Damian Czapiewski May 20 '18 at 21:56
  • Yet, here: https://regex101.com/r/FeEnnL/2 it runs ok. Anyway, I got stuff to go on going on :) – Damian Czapiewski May 20 '18 at 22:01
  • It is because of the way string interpretation works. See edit I made. – revo May 20 '18 at 22:13
0

You could also do it like this.

Readable regex

 ( \w+ )                       # (1), Attribute
 \s* 
 =                             # =
 \s* 
 ( ["'] )                      # (2), Value quote ', or "

 (                             # (3 start), Value
      [^"'\\]*                      # 0 to many not ",', or \ chars
      (?:                           # --------
           (?:                           # One of ...
                \\ [\S\s]                     # Escape + anything
             |                              # or,
                (?! \2 | \\ )                 # Not the value quote, nor escape
                [\S\s] 
           )                             # -----------
           [^"'\\]*                      # 0 to many not ",', or \ chars
      )*                            # Do 0 to many times
 )                             # (3 end)

 \2                            #  Value quote ', or "

var str = "attr1=\"\\'val\\'\\\"1\\\"\" attr2='val2a \\'hello\\' \\\"yo\\\" val2b'\n" +
"attr3=\"val\" attr4=\"val\\\"2a\\\" val2b\"\n";

console.log( str );

var re = /(\w+)\s*=\s*(["'])([^"'\\]*(?:(?:\\[\S\s]|(?!\2|\\)[\S\s])[^"'\\]*)*)\2/g;

while ((m = re.exec(str)) !== null) {
    if (m.index === re.lastIndex)
        re.lastIndex++;

    var atr = m[1];
    var val = m[3];
    // Remove escapes if needed
    val = val.replace(/([^\\'"]|(?=\\["']))((?:\\\\)*)\\(["'])/g, "$1$2$3");

    console.log( atr + " => " + val );
}