0

Say I have

<div class="doublequotes"></div>
<div class='simplequotes'></div>
<customElement data-attr-1=no quotes data-attr-2 = again no quotes/>

I would like to see a nice regex to grab all attribute/vale pairs above as follows:

  • class, doublequotes
  • class, simplequotes
  • data-attr-1, no quotes
  • data-attr-2, again no quotes

Please note in the setup the following

  • presence of both single/double quotes to wrap values
  • possible absence of any quote
  • possible absence of any quote + multiple-word value
Michael
  • 4,786
  • 11
  • 45
  • 68
  • What I guess could be used are some positive/negative look-ahead/behind(s). – Michael Oct 13 '16 at 17:09
  • 2
    Don't use regexp to parse HTML, use a DOM parser library. – Barmar Oct 13 '16 at 17:12
  • Great advice, Barmar, only that I need to parse the content where there is no DOM. Imagine template engines, and others along this line. – Michael Oct 13 '16 at 17:14
  • regexp is bad to parse HTML and even worse when trying to parse very variable things like those quote/no-quote. On the other hand any HTML parser would get '`data-attr-1="no"`, `data-attr-2="again"` and 'quotes', 'no' and 'quotes' (again) equal to the empty string. Some parsers out there allow the input to be a fragment and don't need a root – user1040495 Oct 13 '16 at 17:18
  • 2
    You can't have multiple word values without quotes. `data-attr-1=no quotes` is two attributes: `data-attr-1` with value `no` and `quotes` with no value. – Barmar Oct 13 '16 at 17:19
  • See http://stackoverflow.com/questions/1380041/regex-for-html-attributes-in-php for a similar question. – Barmar Oct 13 '16 at 17:19
  • Thanks for the reference, Barmar! The thing is I managed to cover basic/standard (quotes present, in no qotes - single, no space value) Handle the non-standard seems tricky, though. – Michael Oct 13 '16 at 17:50
  • McCaughan, I would like people have constructive feedback as well. Here you are, my attempt so far, if it is that important to you. <**\s*a\s*href\s*=\s*(("([^">]*)")|('([^'>]*)')|((?!(\s*[ˆ'"\s]*\=))[^'"=>]*)) – Michael Oct 13 '16 at 17:54
  • Parsing HTML with a regex is a **very [bad idea](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)**. Use a DOM parser in your favorite programming language. Also, HTML has required spaced attributes to be quoted since HTML v3 or thereabouts. DOM parsers will correctly extract `data-attr-1=no` and a standalone attribute of `quotes`, etc. – Adam Katz Oct 13 '16 at 17:59
  • 1
    Adam, it is not simple HTML, it is an extented, custom, markup language. – Michael Oct 13 '16 at 18:01
  • Then do not call it HTML. – Adam Katz Oct 13 '16 at 18:03
  • That is fair enough - my bad. Will update. Thank you, sir. – Michael Oct 13 '16 at 18:04

2 Answers2

0

After more than a few tweaks, I have managed to build something

([0-9a-zA-z-]+)\s*=\s*(("([^">]*)")|('([^'>]*)')|(([^'"=>\s]+\s)\s*(?![ˆ\s]*=))*)?

This should deal reasonably even with something like

 <t key1="value1" key2='value2' key3 = value3 key4 = v a l u e 4 key5 = v a l u e 5 />
Michael
  • 4,786
  • 11
  • 45
  • 68
0

Here is a solution, written in Javascript so you can try it out right here, that separates into tags and then attributes, which allows retaining the parent tag (if you don't want that, don't use tag[1]).

A main reason this extracts tags and then attributes is so we don't find false "attributes" outside the tags. Note how the look="a distraction" part is not included in the parsed output.

<textarea id="test" style="width:100%;height:11ex">
<div class="doublequotes"> look="a distraction" </div><div class='simplequotes'></div>
<customElement data-attr-1=no quotes data-attr-2 = again no quotes/>
<t key1="value1" key2='value2' key3 = value3 key4 = v a l u e 4 key5 = v a l u e 5 />
Poorly nested 1 (staggered tags): <a1 b1=c1>foo<d1 e1=f1>bar</a1>baz</d1>
Poorly nested 2 (nested tags): <a2 b2=c2 <d2 e2=f2>>
</textarea>

<script type="text/javascript">
 function parse() {
  var xml = document.getElementById("test").value;  // grab the above text
  var out = "";                                     // assemble the output

  tag_re = /<([^\s>]+)(\s[^>]*\s*\/?>)/g;  // each tag as (name) and (attrs)
  // each attribute, leaving room for future attributes
  attr_re = /([^\s=]+)\s*=\s*("[^"]*"|'[^']*'|[^'"=\/>]*?[^\s\/>](?=\s+\S+\s*=|\s*\/?>))/g;

  while(tag = tag_re.exec(xml)) {           // for each tag
    while (attr = attr_re.exec(tag[2])) {   // for each attribute in each tag
      out += "\n" + tag[1] + " -&gt; " + attr[1] + " -&gt; "
          + attr[2].replace(/^(['"])(.*)\1$/,"$2");  // remove quotes
    }
  };

  document.getElementById("output").innerHTML = out.replace(/</g,"&lt;");

 }

</script>

<button onclick="parse()" style="float:right;margin:0">Parse</button>
<pre id="output" style="display:table"></pre>

I am not sure how complete this is since you haven't explicitly stated what is and is not valid. The comments to the question already establish that this is neither HTML nor XML.

Update: I added to nesting tests, both of which are invalid in XHTML, as an attempt to answer the comment about imbricated elements. This code does not recognize <d2 as a new element because it is inside another element and therefore assumed to be a part of the value of the b2 attribute. Because this included < and > characters, I had to HTML-escape the <s before rendering it to the <pre> tag (this is the final replace() call).

Adam Katz
  • 14,455
  • 5
  • 68
  • 83
  • More complete solution than what I've posted, so it deserves to be accepted. – Michael Oct 16 '16 at 06:52
  • Though, since you tackle tags as well, would this properly handle imbricated elements? – Michael Oct 16 '16 at 06:54
  • By "imbricated," do you mean staggered, like `foobarbaz`? Since this code doesn't deal with nested elements, that'll result in `a -> b -> c` and `d -> e -> f`. Otherwise, I do not understand your question. – Adam Katz Oct 16 '16 at 18:24
  • I was thinking of a next thing to do - managing tag text/content. – Michael Oct 17 '16 at 20:34
  • I made the textbox editable so you can test things on your own. Just press "run code snippet" and then modify the text box. You can keep pressing "parse" to see the output of your modified text. The example you gave just now doesn't have attributes, so it won't return anything. `[^>]` is indeed in the regex, but that shouldn't be a problem. If you need a `>` inside an attribute's value, you'll have to quote it. – Adam Katz Oct 17 '16 at 20:34