Here is a solution, written in Javascript so you can try it out right here, that separates into tags and then attributes, which allows retaining the parent tag (if you don't want that, don't use tag[1]
).
A main reason this extracts tags and then attributes is so we don't find false "attributes" outside the tags. Note how the look="a distraction"
part is not included in the parsed output.
<textarea id="test" style="width:100%;height:11ex">
<div class="doublequotes"> look="a distraction" </div><div class='simplequotes'></div>
<customElement data-attr-1=no quotes data-attr-2 = again no quotes/>
<t key1="value1" key2='value2' key3 = value3 key4 = v a l u e 4 key5 = v a l u e 5 />
Poorly nested 1 (staggered tags): <a1 b1=c1>foo<d1 e1=f1>bar</a1>baz</d1>
Poorly nested 2 (nested tags): <a2 b2=c2 <d2 e2=f2>>
</textarea>
<script type="text/javascript">
function parse() {
var xml = document.getElementById("test").value; // grab the above text
var out = ""; // assemble the output
tag_re = /<([^\s>]+)(\s[^>]*\s*\/?>)/g; // each tag as (name) and (attrs)
// each attribute, leaving room for future attributes
attr_re = /([^\s=]+)\s*=\s*("[^"]*"|'[^']*'|[^'"=\/>]*?[^\s\/>](?=\s+\S+\s*=|\s*\/?>))/g;
while(tag = tag_re.exec(xml)) { // for each tag
while (attr = attr_re.exec(tag[2])) { // for each attribute in each tag
out += "\n" + tag[1] + " -> " + attr[1] + " -> "
+ attr[2].replace(/^(['"])(.*)\1$/,"$2"); // remove quotes
}
};
document.getElementById("output").innerHTML = out.replace(/</g,"<");
}
</script>
<button onclick="parse()" style="float:right;margin:0">Parse</button>
<pre id="output" style="display:table"></pre>
I am not sure how complete this is since you haven't explicitly stated what is and is not valid. The comments to the question already establish that this is neither HTML nor XML.
Update: I added to nesting tests, both of which are invalid in XHTML, as an attempt to answer the comment about imbricated elements. This code does not recognize <d2
as a new element because it is inside another element and therefore assumed to be a part of the value of the b2
attribute. Because this included <
and >
characters, I had to HTML-escape the <
s before rendering it to the <pre>
tag (this is the final replace()
call).