EDIT: I just ended up using Flash's XML capabilities to read the HTML. No need for RegExp selectors!
Here is my ActionScript
var evaluatedInput:RegExp = new RegExp('<([A-Z][A-Z0-9]*)\\b[^>]*>(.*?)</\\1>', 'gi');
var result:Object = evaluatedInput.exec("<p>Hi!</p><span>Hi!</span><table><tbody><tr><td>Hi!</td></tr></tbody></table><img src=\"nice.jpg\" />");
while (result != null) {
trace (result);
result = evaluatedInput.exec("<p>Hi!</p><span>Hi!</span><table><tbody><tr><td>Hi!</td></tr></tbody></table><img src=\"nice.jpg\" />");
}
The content in my output window is, which is exactly what I wanted, only top-level tags are selected:
<p>Hi!</p>,p,Hi!
<span>Hi!</span>,span,Hi!
<table><tbody><tr><td>Hi!</td></tr></tbody></table>,table,<tbody><tr><td>Hi!</td></tr></tbody>
Using the suggested regexp above I get:
<p>,p
</p>,p
<span>,span
</span>,span
<table>,table
<tbody>,tbody
<tr>,tr
<td>,td
</td>,td
</tr>,tr
</tbody>,tbody
</table>,table
<img src="nice.jpg" />,img
So to improve the new regexp I'd like it to:
- Select only top level HTML tags, not nested ones
- Return the tag and tag attributes of what it just selected
- Return the contents, HTML and all, of the tag it selected
Sorry for the crash list of details. :(