HTML Regexp Selector

Question

I a real big noobie when it comes to regexp, so please bear with me. I would like create a regular expression which can select all HTML tags. I have the following selector...

/<([A-Z][A-Z0-9]*)\\b[^>]*>(.*?)</\\1>/gi

... which works great for tags like this...

<p>Paragraph</p>
<span>Span</span>
<p><a href="link.php">Link</a></p>

... but it can't select tags like this:

<img src="picture.jpg" />

Could someone please direct me as to how I could fix the regular expression above so that I could select both styles of HTML tags in one clean move?

Thank your for your time,
spryno724

While a direct opposite of http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454, both questions have the same answer. — BoltClock, Apr 26 '11 at 17:32
A comedy comment that does nothing to help the user is just plain mean. — tchrist, Apr 26 '11 at 19:51
It isn't very clear what is your goal. You want to "select all HTML tags" - from where? How will you use them? If you have an HTML file, all tags are contained whiting the `` and `` tags. Also, your pattern fails when dealing with nested tags: ``. — Kobi, Apr 26 '11 at 20:03

score 1 · Answer 1 · answered Apr 26 '11 at 17:38

1

Hmm. Okay, so you're looking for something like:

/</?([a-z][a-z0-9]*)[^<>]*>/

answered Apr 26 '11 at 17:38

omninonsense

6,644
9
45
66

Hmm... close but it doesn't select the `` tag. :( – Oliver Spryn Apr 26 '11 at 17:42
1

Uh...yes it does. What language are you using? – josh.trow Apr 26 '11 at 17:44
My bad it did work, but not quite as expected. ActionScript 3.0 I'll post the code I'm using below to help out. – Oliver Spryn Apr 26 '11 at 18:29

Oliver Spryn · Accepted Answer · 2011-05-24T17:28:07.160

EDIT: I just ended up using Flash's XML capabilities to read the HTML. No need for RegExp selectors!

Here is my ActionScript

var evaluatedInput:RegExp = new RegExp('<([A-Z][A-Z0-9]*)\\b[^>]*>(.*?)</\\1>', 'gi');
var result:Object = evaluatedInput.exec("<p>Hi!</p><span>Hi!</span><table><tbody><tr><td>Hi!</td></tr></tbody></table><img src=\"nice.jpg\" />");

while (result != null) {             
  trace (result);
  result = evaluatedInput.exec("<p>Hi!</p><span>Hi!</span><table><tbody><tr><td>Hi!</td></tr></tbody></table><img src=\"nice.jpg\" />");
}

The content in my output window is, which is exactly what I wanted, only top-level tags are selected:

<p>Hi!</p>,p,Hi!
<span>Hi!</span>,span,Hi!
<table><tbody><tr><td>Hi!</td></tr></tbody></table>,table,<tbody><tr><td>Hi!</td></tr></tbody>

Using the suggested regexp above I get:

<p>,p
</p>,p
<span>,span
</span>,span
<table>,table
<tbody>,tbody
<tr>,tr
<td>,td
</td>,td
</tr>,tr
</tbody>,tbody
</table>,table
<img src="nice.jpg" />,img

So to improve the new regexp I'd like it to:

Select only top level HTML tags, not nested ones
Return the tag and tag attributes of what it just selected
Return the contents, HTML and all, of the tag it selected

Sorry for the crash list of details. :(

I suggest looking into an XHTML parser, or something. Doing this with regexp would be possible, but really, really unpleasant. — omninonsense, Apr 26 '11 at 18:48

HTML Regexp Selector

2 Answers2