0

I have a huge html document that has different tables with unique table IDs. Something like:

<table class="my_table" id="table_id1">
  <tr class="odd"><td>Line 1</td></tr>
  <tr class="even"><td>Line 2</td></tr>
  <tr class="odd"><td>Line 3</td></tr>
  <tr class="even"><td>Line 4</td></tr>
</table>
<table class="my_table" id="table_id2">
  <tr class="odd"><td>Line 1</td></tr>
  <tr class="even"><td>Line 2</td></tr>
  <tr class="odd"><td>Line 3</td></tr>
</table>

Is it possible using preg_match to find HTML of all rows of a specific table?

I tried the following code:

preg_match('/<table[^>]*id="table_id2">(<tr[^>]*><td>[^>]*<\/td><\/tr>)+/', $html, $matches); 
//$html variable contains the html.

but it returns the output like:

Array
(
    [0] => Array
        (
            [0] => <table class="my_table" id="table_id2"><tr class="odd"><td>Line 1</td></tr><tr class="even"><td>Line 2</td></tr><tr class="odd"><td>Line 3</td></tr>
        )

    [1] => Array
        (
            [0] => <tr class="odd"><td>Line 3</td></tr>
        )

)

But I need the output like this:

Array
(
    [0] => Array
        (
            [0] => <table class="my_table" id="table_id2"><tr class="odd"><td>Line 1</td></tr><tr class="even"><td>Line 2</td></tr><tr class="odd"><td>Line 3</td></tr>
        )

    [1] => Array
        (
            [0] => <tr class="odd"><td>Line 1</td></tr>
            [1] => <tr class="odd"><td>Line 2</td></tr>
            [2] => <tr class="odd"><td>Line 3</td></tr>
        )

)

Is it possible? Please help.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
qasimzee
  • 640
  • 1
  • 12
  • 30
  • 4
    Is there a reason you're not using DOM or SAX to actually parse the HTML? It would probably be a lot easier and more reliable. – brianmearns Sep 04 '13 at 12:13
  • 2
    **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Sep 04 '13 at 12:26
  • Yeah, I think DOMDocument is a much better solution – qasimzee Sep 04 '13 at 13:59

2 Answers2

2

You should not use regex for parsing HTML. PHP has a great tool for that - DOMDocument. Using it, you can do many things, that are impossible/near impossible with regex. Your sample will look like:

$sHtml = '<table class="my_table" id="table_id1">
  <tr class="odd"><td>Line 1</td></tr>
  <tr class="even"><td>Line 2</td></tr>
  <tr class="odd"><td>Line 3</td></tr>
  <tr class="even"><td>Line 4</td></tr>
</table>
<table class="my_table" id="table_id2">
  <tr class="odd"><td>Line 1</td></tr>
  <tr class="even"><td>Line 2</td></tr>
  <tr class="odd"><td>Line 3</td></tr>
</table>';

$rDoc   = new DOMDocument();
$rDoc->loadHTML($sHtml);
$sId    = 'table_id2';
//found table:
$rTable = $rDoc->getElementById($sId);
foreach($rTable->childNodes as $rItem)
{
   //do something with item:
   //var_dump($rItem);
}
Alma Do
  • 37,009
  • 9
  • 76
  • 105
0

Try this. It's very similar to what you had, but I put a non-capturing grouping around each row, as well as some leading and trailing optional whitespace in each row.

For reference, the regex used is

/<table[^>]*id="table_id2">((?:\s*<tr[^>]*><td>[^>]*<\/td><\/tr>\s*)+)/
brianmearns
  • 9,581
  • 10
  • 52
  • 79