Write regex to match markup element like attribute/values even ones not wrapped in quotes

Question

Say I have

<div class="doublequotes"></div>
<div class='simplequotes'></div>
<customElement data-attr-1=no quotes data-attr-2 = again no quotes/>

I would like to see a nice regex to grab all attribute/vale pairs above as follows:

class, doublequotes
class, simplequotes
data-attr-1, no quotes
data-attr-2, again no quotes

Please note in the setup the following

presence of both single/double quotes to wrap values
possible absence of any quote
possible absence of any quote + multiple-word value

What I guess could be used are some positive/negative look-ahead/behind(s). — Michael, Oct 13 '16 at 17:09
Great advice, Barmar, only that I need to parse the content where there is no DOM. Imagine template engines, and others along this line. — Michael, Oct 13 '16 at 17:14
regexp is bad to parse HTML and even worse when trying to parse very variable things like those quote/no-quote. On the other hand any HTML parser would get '`data-attr-1="no"`, `data-attr-2="again"` and 'quotes', 'no' and 'quotes' (again) equal to the empty string. Some parsers out there allow the input to be a fragment and don't need a root — user1040495, Oct 13 '16 at 17:18
You can't have multiple word values without quotes. `data-attr-1=no quotes` is two attributes: `data-attr-1` with value `no` and `quotes` with no value. — Barmar, Oct 13 '16 at 17:19
See http://stackoverflow.com/questions/1380041/regex-for-html-attributes-in-php for a similar question. — Barmar, Oct 13 '16 at 17:19
Thanks for the reference, Barmar! The thing is I managed to cover basic/standard (quotes present, in no qotes - single, no space value) Handle the non-standard seems tricky, though. — Michael, Oct 13 '16 at 17:50
McCaughan, I would like people have constructive feedback as well. Here you are, my attempt so far, if it is that important to you. <**\s*a\s*href\s*=\s*(("([^">]*)")|('([^'>]*)')|((?!(\s*[ˆ'"\s]*\=))[^'"=>]*)) — Michael, Oct 13 '16 at 17:54
Parsing HTML with a regex is a **very [bad idea](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)**. Use a DOM parser in your favorite programming language. Also, HTML has required spaced attributes to be quoted since HTML v3 or thereabouts. DOM parsers will correctly extract `data-attr-1=no` and a standalone attribute of `quotes`, etc. — Adam Katz, Oct 13 '16 at 17:59
Adam, it is not simple HTML, it is an extented, custom, markup language. — Michael, Oct 13 '16 at 18:01

Michael · Answer 1 · 2016-10-13T20:52:53.017

0

After more than a few tweaks, I have managed to build something

([0-9a-zA-z-]+)\s*=\s*(("([^">]*)")|('([^'>]*)')|(([^'"=>\s]+\s)\s*(?![ˆ\s]*=))*)?

This should deal reasonably even with something like

 <t key1="value1" key2='value2' key3 = value3 key4 = v a l u e 4 key5 = v a l u e 5 />

edited Oct 13 '16 at 20:52

answered Oct 13 '16 at 20:41

Michael

4,786
11
45
68

Adam Katz · Accepted Answer · 2016-10-17T17:42:58.043

Here is a solution, written in Javascript so you can try it out right here, that separates into tags and then attributes, which allows retaining the parent tag (if you don't want that, don't use tag[1]).

A main reason this extracts tags and then attributes is so we don't find false "attributes" outside the tags. Note how the look="a distraction" part is not included in the parsed output.

<textarea id="test" style="width:100%;height:11ex">
<div class="doublequotes"> look="a distraction" </div><div class='simplequotes'></div>
<customElement data-attr-1=no quotes data-attr-2 = again no quotes/>
<t key1="value1" key2='value2' key3 = value3 key4 = v a l u e 4 key5 = v a l u e 5 />
Poorly nested 1 (staggered tags): <a1 b1=c1>foo<d1 e1=f1>bar</a1>baz</d1>
Poorly nested 2 (nested tags): <a2 b2=c2 <d2 e2=f2>>
</textarea>

<script type="text/javascript">
 function parse() {
  var xml = document.getElementById("test").value;  // grab the above text
  var out = "";                                     // assemble the output

  tag_re = /<([^\s>]+)(\s[^>]*\s*\/?>)/g;  // each tag as (name) and (attrs)
  // each attribute, leaving room for future attributes
  attr_re = /([^\s=]+)\s*=\s*("[^"]*"|'[^']*'|[^'"=\/>]*?[^\s\/>](?=\s+\S+\s*=|\s*\/?>))/g;

  while(tag = tag_re.exec(xml)) {           // for each tag
    while (attr = attr_re.exec(tag[2])) {   // for each attribute in each tag
      out += "\n" + tag[1] + " -&gt; " + attr[1] + " -&gt; "
          + attr[2].replace(/^(['"])(.*)\1$/,"$2");  // remove quotes
    }
  };

  document.getElementById("output").innerHTML = out.replace(/</g,"&lt;");

 }

</script>

<button onclick="parse()" style="float:right;margin:0">Parse</button>
<pre id="output" style="display:table"></pre>

I am not sure how complete this is since you haven't explicitly stated what is and is not valid. The comments to the question already establish that this is neither HTML nor XML.

Update: I added to nesting tests, both of which are invalid in XHTML, as an attempt to answer the comment about imbricated elements. This code does not recognize <d2 as a new element because it is inside another element and therefore assumed to be a part of the value of the b2 attribute. Because this included < and > characters, I had to HTML-escape the <s before rendering it to the <pre> tag (this is the final replace() call).

More complete solution than what I've posted, so it deserves to be accepted. — Michael, Oct 16 '16 at 06:52
Though, since you tackle tags as well, would this properly handle imbricated elements? — Michael, Oct 16 '16 at 06:54
By "imbricated," do you mean staggered, like `foobarbaz`? Since this code doesn't deal with nested elements, that'll result in `a -> b -> c` and `d -> e -> f`. Otherwise, I do not understand your question. — Adam Katz, Oct 16 '16 at 18:24
I was thinking of a next thing to do - managing tag text/content. — Michael, Oct 17 '16 at 20:34
I made the textbox editable so you can test things on your own. Just press "run code snippet" and then modify the text box. You can keep pressing "parse" to see the output of your modified text. The example you gave just now doesn't have attributes, so it won't return anything. `[^>]` is indeed in the regex, but that shouldn't be a problem. If you need a `>` inside an attribute's value, you'll have to quote it. — Adam Katz, Oct 17 '16 at 20:34

Write regex to match markup element like attribute/values even ones not wrapped in quotes

2 Answers2