3

I have a table that I am trying to parse

<tr>...</tr>
<tr>...</tr>
<tr>...</tr>
<tr>...</tr>

each row is similarly formatted and I want to split them apart using regular expressions. I have tried everything I can think of but it always seems to take the whole contents as the match

I've tried stuff like this

$pattern = ':(<tr>.*</tr>):';
 preg_match_all( $pattern , $working, &$regs2  );

but it always maximally takes everything in one go rather than minimally taking it row by row.

This is probably pretty basic but I just can't seen to get it.

paullb
  • 4,293
  • 6
  • 37
  • 65
  • 1
    Long story short: Don't use regex to parse HTML. Use a real XML parser. – Rafe Kettler Mar 10 '11 at 14:32
  • That was the first thing I tried, but it didn't parse the HTML and I got nothing out of the parser which is why I resorted to regex – paullb Mar 10 '11 at 14:33
  • If you have to use regexp to parse HTML, then learn about "greedy" and "ungreedy"... and you're right, it's pretty basic – Mark Baker Mar 10 '11 at 14:34
  • Don't. Do. It. Here's why: http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-rege – fabrik Mar 10 '11 at 14:38

3 Answers3

3

You need to make the .* pattern non-greedy by adding a ?. Try .*? as the middle pattern and see if the problem persists.

Really, you shouldn't use regex to parse HTML, but you did ask what was wrong, so...

Rafe Kettler
  • 75,757
  • 21
  • 156
  • 151
  • Absolutely prefect. (and I really agree that I shouldn't use regex to parse HTML but other possibilities seem to fail). – paullb Mar 10 '11 at 14:40
2

In the regex tester I usually use, it seems to work normally. (http://regexpal.com/) If it seems like it's too greedy, try using a ? after the * to calm it down a bit. If you're not wanting the capture the <tr></tr> move the () to the inside, like <tr>(.*?)</tr>/

ShaneTheKing
  • 725
  • 5
  • 18
1

http://simplehtmldom.sourceforge.net/ Use Simple HTML DOm, it will make parsing the table quite easy

dm03514
  • 54,664
  • 18
  • 108
  • 145