1

Possible Duplicate:
Best methods to parse HTML with PHP

I'm having a bit of trouble matching table rows with preg. Here is my expression:

<TR[a-z\=\"a-z0-9 ]*>([\{\}\(\)\^\=\$\&\.\_\%\#\!\@\=\<\>\:\;\,\~\`\'\*\?\/\+\|\[\]\|\-a-zA-Z0-9À-ÿ\n\r ]*)<\/TR>

As you can see, it tries to mach everything in-between TR tags (including all symbols.) That part works great, however when dealing with multiple table rows, it often takes multiple table rows as ONE match, rather than a match for each table row:

<TR>
 <TD>test</TD>
</TR>
<TR>
 <TD>test2</TD>
</TR>

yields:

Array
    (
        [0] => <TD>test</TD>
               <TD>test2</TD>
    )

rather than what I want it to:

Array
    (
        [0] => <TD>test</TD>
        [1] => <TD>test2</TD>
    )

I realize that the reason for this is because it's match the symbols, and the search naturally takes the rest of the rows until it hits the last one.

So basically, I'm wondering if someone can help me add to the expression so that it will exclude anything with "TR" in between the TR tags, as to prevent it from matching multiple rows.

Community
  • 1
  • 1
user925996
  • 31
  • 1
  • 2
  • 1
    *(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Sep 02 '11 at 21:04
  • 1
    Do you have an option to use a PHP HTML Parser instead of regex? – Chandu Sep 02 '11 at 21:04
  • 1
    Instead of manual anyting: there are readymade html table extraction libraries for php. – mario Sep 02 '11 at 21:09
  • Use the [PHP DOM](http://php.net/manual/en/book.dom.php) to do this, not regex. Using regex to parse HTML is generally considered a [bad idea](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html). A (somewhat entertaining) take on it: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Jared Farrish Sep 02 '11 at 21:04
  • It doesn't answer your question, but **don't do this**: `[\{\}\(\)\^\=\$\&\.\_\%\#\!\@\=\<\>\:\;\,\~\\`\'\*\?\/\+\|\[\]\|\-a-zA-Z0-9À-ÿ\n\r ]` Because it's a horrible mess and you don't need to put backslashes before almost all of those. You only need escape: `[` and `]` and `\ ` and `-` (when not first/last) and `^` (when first). Here's a much easier to read version. `[{}()^=$&._%#!@<>:;,~\`'*?/+\[\]|\-a-zA-Z0-9À-ÿ\n\r ]` – Peter Boughton Sep 04 '11 at 21:52

2 Answers2

4

Use lazy matching in your regex: <tr.*?</tr>

But as others have mentioned, it's more robust to use a proper parser if you can.

Bennett McElwee
  • 24,740
  • 6
  • 54
  • 63
2

Try using global search:

preg_match_all("/<td>([^<]+)/", $html, $matches);

Kakashi
  • 2,165
  • 14
  • 19
  • That almost works, however I need everything in between the tags, not just individual items from the td tags. Instead of just excluding the "<" from the "[^<]" in your expression, would it somehow be possible to exclude the string "TR" or even ""? – user925996 Sep 02 '11 at 21:18
  • try setting the `sim` flags and replace `td` by `tr` in regex: `/([<]+)/sim` – Kakashi Sep 02 '11 at 21:39