Regex matching table rows in HTML

Question

Possible Duplicate:
Best methods to parse HTML with PHP

I'm having a bit of trouble matching table rows with preg. Here is my expression:

<TR[a-z\=\"a-z0-9 ]*>([\{\}\(\)\^\=\$\&\.\_\%\#\!\@\=\<\>\:\;\,\~\`\'\*\?\/\+\|\[\]\|\-a-zA-Z0-9À-ÿ\n\r ]*)<\/TR>

As you can see, it tries to mach everything in-between TR tags (including all symbols.) That part works great, however when dealing with multiple table rows, it often takes multiple table rows as ONE match, rather than a match for each table row:

<TR>
 <TD>test</TD>
</TR>
<TR>
 <TD>test2</TD>
</TR>

yields:

Array
    (
        [0] => <TD>test</TD>
               <TD>test2</TD>
    )

rather than what I want it to:

Array
    (
        [0] => <TD>test</TD>
        [1] => <TD>test2</TD>
    )

I realize that the reason for this is because it's match the symbols, and the search naturally takes the rest of the rows until it hits the last one.

So basically, I'm wondering if someone can help me add to the expression so that it will exclude anything with "TR" in between the TR tags, as to prevent it from matching multiple rows.

*(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Sep 02 '11 at 21:04
Do you have an option to use a PHP HTML Parser instead of regex? — Chandu, Sep 02 '11 at 21:04
Instead of manual anyting: there are readymade html table extraction libraries for php. — mario, Sep 02 '11 at 21:09
Use the [PHP DOM](http://php.net/manual/en/book.dom.php) to do this, not regex. Using regex to parse HTML is generally considered a [bad idea](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html). A (somewhat entertaining) take on it: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Jared Farrish, Sep 02 '11 at 21:04
It doesn't answer your question, but **don't do this**: `[\{\}\^\=\$\&\.\_\%\#\!\@\=\<\>\:\;\,\~\\`\'\*\?\/\+\|\[\]\|\-a-zA-Z0-9À-ÿ\n\r ]` Because it's a horrible mess and you don't need to put backslashes before almost all of those. You only need escape: `[` and `]` and `\ ` and `-` (when not first/last) and `^` (when first). Here's a much easier to read version. `[{}()^=$&._%#!@<>:;,~\`'*?/+\[\]|\-a-zA-Z0-9À-ÿ\n\r ]` — Peter Boughton, Sep 04 '11 at 21:52

Bennett McElwee · Answer 1 · 2013-08-31T12:08:31.410

4

Use lazy matching in your regex: <tr.*?</tr>

But as others have mentioned, it's more robust to use a proper parser if you can.

edited Aug 31 '13 at 12:08

answered Sep 04 '11 at 21:45

Bennett McElwee

24,740
6
54
63

I have tried simple html parser and ganon but both failed on broken HTML which i have got to parse. – Ravi Soni Aug 30 '13 at 07:08

score 2 · Answer 2 · answered Sep 02 '11 at 21:11

2

Try using global search:

preg_match_all("/<td>([^<]+)/", $html, $matches);

answered Sep 02 '11 at 21:11

Kakashi

2,165
14
19

That almost works, however I need everything in between the tags, not just individual items from the td tags. Instead of just excluding the "<" from the "[^<]" in your expression, would it somehow be possible to exclude the string "TR" or even ""? – user925996 Sep 02 '11 at 21:18
try setting the `sim` flags and replace `td` by `tr` in regex: `/([<]+)/sim` – Kakashi Sep 02 '11 at 21:39

Regex matching table rows in HTML

2 Answers2

Linked