2

I a real big noobie when it comes to regexp, so please bear with me. I would like create a regular expression which can select all HTML tags. I have the following selector...

/<([A-Z][A-Z0-9]*)\\b[^>]*>(.*?)</\\1>/gi

... which works great for tags like this...

<p>Paragraph</p>
<span>Span</span>
<p><a href="link.php">Link</a></p>

... but it can't select tags like this:

<img src="picture.jpg" />

Could someone please direct me as to how I could fix the regular expression above so that I could select both styles of HTML tags in one clean move?

Thank your for your time,
spryno724

Oliver Spryn
  • 16,871
  • 33
  • 101
  • 195
  • 2
    While a direct opposite of http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454, both questions have the same answer. – BoltClock Apr 26 '11 at 17:32
  • 1
    Oh, Bolt, I love that post. LOL – omninonsense Apr 26 '11 at 17:42
  • 2
    A comedy comment that does nothing to help the user is just plain mean. – tchrist Apr 26 '11 at 19:51
  • It isn't very clear what is your goal. You want to "select all HTML tags" - from where? How will you use them? If you have an HTML file, all tags are contained whiting the `` and `` tags. Also, your pattern fails when dealing with nested tags: ``. – Kobi Apr 26 '11 at 20:03

2 Answers2

1

Hmm. Okay, so you're looking for something like:

/</?([a-z][a-z0-9]*)[^<>]*>/
omninonsense
  • 6,644
  • 9
  • 45
  • 66
1

EDIT: I just ended up using Flash's XML capabilities to read the HTML. No need for RegExp selectors!

Here is my ActionScript

var evaluatedInput:RegExp = new RegExp('<([A-Z][A-Z0-9]*)\\b[^>]*>(.*?)</\\1>', 'gi');
var result:Object = evaluatedInput.exec("<p>Hi!</p><span>Hi!</span><table><tbody><tr><td>Hi!</td></tr></tbody></table><img src=\"nice.jpg\" />");

while (result != null) {             
  trace (result);
  result = evaluatedInput.exec("<p>Hi!</p><span>Hi!</span><table><tbody><tr><td>Hi!</td></tr></tbody></table><img src=\"nice.jpg\" />");
}

The content in my output window is, which is exactly what I wanted, only top-level tags are selected:

<p>Hi!</p>,p,Hi!
<span>Hi!</span>,span,Hi!
<table><tbody><tr><td>Hi!</td></tr></tbody></table>,table,<tbody><tr><td>Hi!</td></tr></tbody>

Using the suggested regexp above I get:

<p>,p
</p>,p
<span>,span
</span>,span
<table>,table
<tbody>,tbody
<tr>,tr
<td>,td
</td>,td
</tr>,tr
</tbody>,tbody
</table>,table
<img src="nice.jpg" />,img

So to improve the new regexp I'd like it to:

  • Select only top level HTML tags, not nested ones
  • Return the tag and tag attributes of what it just selected
  • Return the contents, HTML and all, of the tag it selected

Sorry for the crash list of details. :(

Oliver Spryn
  • 16,871
  • 33
  • 101
  • 195
  • I suggest looking into an XHTML parser, or something. Doing this with regexp would be possible, but really, really unpleasant. – omninonsense Apr 26 '11 at 18:48