Regular Expression (regex) to Parse HTML Segment

Question

I am currently trying to come up with a regular expression that will parse out something like the following:

ORIGINAL HTML:

<td align="center"><p>line 1</p><p>line 2</p><p>line 3</p></td>

INTENDED HTML:

<td align="center">line 1<br />line 2<br />line 3</td>

Note that there are other <p>...</p> tags throughout the HTML document that must not be touched. I only want to replace <p>...</p> within a <td> or <th> only.

I would also need a regexp to reverse the process. Please note that these regular expressions have to work in VB/VBScript/Classic ASP, so although I can use lookaheads (which I think is the key here), I cannot use lookbehinds. Some regex's I've tried unsuccessfully are:

1. <td[^>]*>(<p>.+<\/p>)<\/td>
2. <td[^>]*>(<p>.+<\/p>)+?<\/td>
3. <td[^>]*><p>(?:(.+?)<\/p><p>(.+))+<\/p><\/td>
4. <td[^>]*>(<p>(?:(?!<\/p>)).*<\/p>)+?<\/td>
5. <td[^>]*>(?:<p>(.+?)<\/p>)*(?:<p>(.+)<\/p>)<\/td>
6. <td[^>]*>(?:<p>(.+?)<\/p>)(?:<p>(.+)<\/p>)*(?:<p>(.+)<\/p>)<\/td>

I can "cheat" and pull out the entire line and then parse it manually usually standard VB string manipulation functions, but that's definitely not the most elegant, nor the fastest way. There has to be some way to do this in one shot using RegEx's.

Eventually I'd like to take...

<td align="center"><p><span style="color:#ff0000;"><strong>line 1</strong></span></p><p>line 2</p><p>line 3</p></td>

...and turn it into

<td align="center"><span style="color:#ff0000;"><strong>line 1</strong></span><br />line 2<br />line 3</td>

Any ideas (besides not to do this with a regex, lol)?

Thank you!

Have you thought of using an HTML parser and apply some DOM operations on it instead? — Gumbo, Jan 18 '11 at 19:59
possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Hank Gay, Jan 18 '11 at 20:01
@Hank, that link is not helpful to someone who does not already understand why RegExes cannot parse HTML. — Dour High Arch, Jan 18 '11 at 23:12
@Dour it's an auto-comment created when I voted to close as dupe, which it is. There are *at least* a dozen other dupes explaining why it's impossible and a bad idea, all easily findable, if you'd prefer to close as dupe for a different one. I chose this one in the hopes that the entertaining writing would be enough to convince the OP, or possibly the thousands of upvotes. All my prior efforts seem to have been ineffective. — Hank Gay, Jan 19 '11 at 11:20
I've read through many, perhaps dozens of other questions, answers, and comments regarding this. Most, if not all, explain the issues with a regex and HTML, however, many more people use regex's daily to parse just that, HTML. I couldn't find my example exactly which is why I wrote this question. If my specific case can't be done using a single regex (or even a couple), it's unfortunate, but not the end of the world. I will have to resort to other means. Other people may have more experience with advanced regex's and could've perhaps come up with a workaround, hence the question. — Zycon, Jan 20 '11 at 14:40

score 0 · Answer 1 · answered Jan 18 '11 at 20:09

Regular expression are not suited for a irregular language like HTML. You should better use a proper HTML parser.

You could use PHP’s DOM library:

$doc = new DOMDocument();
$doc->loadHTML($code);
$xpath = new DOMXpath($doc);
forach ($xpath->query('//td/p') as $i => $elem) {  // find all P elements that are a child of a TD
    if ($i != 0) {                                  // add BR for any P except the first
        $elem->parentNode->insertBefore($doc->createElement('br'), $elem);
    }
    foreach ($elem->childNodes as $nodes) {        // move contents out of P
        $elem->parentNode->insertBefore($node, $elem);
    }
    $elem->parentNode->removeChild($elem);         // remove empty P
}

score 0 · Answer 2 · answered Jan 18 '11 at 23:16

0

Here's your problem:

There has to be some way to do this in one shot using RegEx's.

This is false, there is no way. It's mathematically impossible. Regular expressions, even ones with lookahead, cannot maintain state required to parse an HTML expression.

You have to use an HTML parser. Many have been written, if you specify your target environment we can help you select one. For example, in .Net the HTML Agility Pack is good.

answered Jan 18 '11 at 23:16

Dour High Arch

21,513
29
75
90

Unfortunately this page is in Classic ASP, not .NET (...yeah I know), so I can't readily use any of the .NET add-ons. – Zycon Jan 20 '11 at 14:55
@Zycon, does classic ASP support ISAPI filters? You can write a filter that uses an HTML parser to do the translation after ASP generates the page. – Dour High Arch Jan 20 '11 at 21:44

score 0 · Answer 3 · answered Jan 24 '11 at 18:36

ASP and IIS, more specifically, do support ISAPI filters, however, I didn't want or have to resort to it. The HTML segment is only a string, and not part of a DOM tree (although I could've converted it to one if need be).

Ultimately, here's how I resolved the issue since a straight regex apparently cannot do what I want:

RE3.Pattern = "<td[^>]*><p>.+?<\/p><\/td>"
Set Matches = RE3.Execute(it)
If Matches.Count > 0 Then
   RE3.Pattern = "<p[^>]*>"
   For Each Match In Matches
      itxt_tmp = Replace(Replace(RE3.Replace(Match.Value,""),"</p>","<br />"),"<br /></td>","</td>")
      it = Replace(it,Match.Value,itxt_tmp)
   Next
End If
Set Matches = Nothing

And to go back to the original:

RE.Pattern = "<td[^>]*>.+?<\/td>"
Set Matches = RE.Execute(itxt)
If Matches.Count > 0 Then
   For Each Match In Matches
      If InStr(1,Match.Value,"<br />") > 1 Then
         RE.Pattern = "<td([^>]*)>"
         itxt_tmp = RE.Replace(Replace(Replace(Match.Value,"<br />","</p><p>"),"</td>","</p></td>"),"<td$1><p>")
         itxt = Replace(itxt,Match.Value,itxt_tmp)
      End If
   Next
End If
Set Matches = Nothing

Probably not the fastest way, nor the best way, but it does the job. Whether or not this helps someone else with a similar problem, I do not know, but figured I'd toss this code segment out there just in case, anyways.

Regular Expression (regex) to Parse HTML Segment

3 Answers3