Regular expression to parse values from somewhat complex HTML table

Question

I have this web page with contains a lot of tables. I cannot change that page but need a way to work with the data on that page in a different application, so I need to be able to parse it and extract some data. I am terrible with regular expressions so would really appreciate some help on this. I will most likely use the regular expression in a PHP (Laravel) application if that's relevant to the syntax.

The web page I need to parse contains a lot of these (among other things):

<!-- Post number: 10000 -->
<!-- 127.0.0.1  127.0.0.1 -->
<table class="message" cellspacing="0" cellpadding="0" border="0">
    <tr>
        <td>
            <table cellspacing="0" cellpadding="0" border="0">
                <tr>
                    <td class="tableheader2" nowrap>
                        <B>Name: </B> Firstname Lastname
                    </td>
                    <td class="tableheader2" nowrap>
                        <a href="url.html?param=10000" target="_blank">
                            <img src="image.png" alt="Alt message" border="0">
                        </a>
                        &nbsp;
                        <a href="url2.html?param2=20000">
                            <img src="image2.png" alt="Alt message" border="0">
                        </a>
                        &nbsp;
                    </td>
                    <td class="tableheader2" width="100%">
                        &nbsp;
                    </td>
                </tr>
                <tr>
                    <TD class="tableheader2" WIDTH=520 colspan="3">
                        <b>
                            Sent:  
                        </b>
                        2014-01-01 11:00:00<BR>
                    </td>
                </tr>
            </table>
        </td>
    </tr>
    <tr>
        <td class="tableheader2">
            <table class="tableheader2" CELLSPACING=0 CELLPADDING=0 BORDER=0>
                <tr>
                    <td>
                        &nbsp;
                    </td>
                    <td>
                        Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quos, amet neque non voluptate facilis natus ullam impedit veritatis libero maiores.
                    </td>
                    <td>
                        &nbsp;
                    </td>
                </tr>
            </table>
        </td>
    </tr>
</table>
<hr align="left">

That's just one of many such posts in a long row. I have also edited a bit (indents) for readability.

What I need is to be able to parse that entire page and grab all of these elements (I will be using their values from the example abow, but it could off course be anything):

10000 (from Post number comment)
Firstname Lastname
2014-01-01 11:00:00
Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quos, amet neque non voluptate facilis natus ullam impedit veritatis libero maiores.

Any help with this would be very appreciated. I would have provided sample code, but none of my own futile attempts are even close so that would propably only be contra productive.

**Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. — Andy Lester, Mar 06 '14 at 14:08

Ja͢ck · Accepted Answer · 2014-03-06T15:52:21.083

This kind of stuff always has some guess work, but DOMDocument can definitely help:

$d = new DOMDocument;
$d->loadHTML($html);

$x = new DOMXPath($d);

foreach ($x->query('//table[@class="message"]') as $message) {
    // find preceding comment
    $start = $message->previousSibling;
    while ($start && !preg_match('/Post number:\s*(\d+)/', $start->nodeValue, $match)) {
        $start = $start->previousSibling;
    }
    if ($start === null) {
        continue; // comment not found
    }
    $post = $match[1];
    foreach ($x->query('tr[1]//td[@class="tableheader2"]', $message) as $hdr) {
        if (preg_match('/Name:\s*(.*)/', $hdr->nodeValue, $match)) {
            $name = rtrim($match[1]); // found name
        } elseif (preg_match('/Sent:\s*(.*)/', $hdr->nodeValue, $match)) {
            $sent = rtrim($match[1]); // found sent
        }
    }
    // find description from the next row
    $desc = trim($x->query('tr[2]//table[@class="tableheader2"]/tr/td[2]', $message)->item(0)->nodeValue);
    echo "Post: $post\nName: $name\nSent: $sent\nDesc: $desc\n";
}

This was beautiful. Worked right out of the box without any problems. Thanks so much! — nildog, Mar 06 '14 at 14:20
Update: I see now that every post gets the exact same $desc value. The one from the first match. I will have to look into this. So, perfect, besides that ;-) — nildog, Mar 06 '14 at 14:28
I give up - any ideas on how to solve this problem with all posts getting the same description? — nildog, Mar 06 '14 at 15:00

Regular expression to parse values from somewhat complex HTML table

1 Answers1