Finding all consecutive occurences of a pattern using preg_match after a specific string

Question

I have a huge html document that has different tables with unique table IDs. Something like:

<table class="my_table" id="table_id1">
  <tr class="odd"><td>Line 1</td></tr>
  <tr class="even"><td>Line 2</td></tr>
  <tr class="odd"><td>Line 3</td></tr>
  <tr class="even"><td>Line 4</td></tr>
</table>
<table class="my_table" id="table_id2">
  <tr class="odd"><td>Line 1</td></tr>
  <tr class="even"><td>Line 2</td></tr>
  <tr class="odd"><td>Line 3</td></tr>
</table>

Is it possible using preg_match to find HTML of all rows of a specific table?

I tried the following code:

preg_match('/<table[^>]*id="table_id2">(<tr[^>]*><td>[^>]*<\/td><\/tr>)+/', $html, $matches); 
//$html variable contains the html.

but it returns the output like:

Array
(
    [0] => Array
        (
            [0] => <table class="my_table" id="table_id2"><tr class="odd"><td>Line 1</td></tr><tr class="even"><td>Line 2</td></tr><tr class="odd"><td>Line 3</td></tr>
        )

    [1] => Array
        (
            [0] => <tr class="odd"><td>Line 3</td></tr>
        )

)

But I need the output like this:

Array
(
    [0] => Array
        (
            [0] => <table class="my_table" id="table_id2"><tr class="odd"><td>Line 1</td></tr><tr class="even"><td>Line 2</td></tr><tr class="odd"><td>Line 3</td></tr>
        )

    [1] => Array
        (
            [0] => <tr class="odd"><td>Line 1</td></tr>
            [1] => <tr class="odd"><td>Line 2</td></tr>
            [2] => <tr class="odd"><td>Line 3</td></tr>
        )

)

Is it possible? Please help.

Is there a reason you're not using DOM or SAX to actually parse the HTML? It would probably be a lot easier and more reliable. — brianmearns, Sep 04 '13 at 12:13
**Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. — Andy Lester, Sep 04 '13 at 12:26

score 2 · Accepted Answer · answered Sep 04 '13 at 12:24

You should not use regex for parsing HTML. PHP has a great tool for that - DOMDocument. Using it, you can do many things, that are impossible/near impossible with regex. Your sample will look like:

$sHtml = '<table class="my_table" id="table_id1">
  <tr class="odd"><td>Line 1</td></tr>
  <tr class="even"><td>Line 2</td></tr>
  <tr class="odd"><td>Line 3</td></tr>
  <tr class="even"><td>Line 4</td></tr>
</table>
<table class="my_table" id="table_id2">
  <tr class="odd"><td>Line 1</td></tr>
  <tr class="even"><td>Line 2</td></tr>
  <tr class="odd"><td>Line 3</td></tr>
</table>';

$rDoc   = new DOMDocument();
$rDoc->loadHTML($sHtml);
$sId    = 'table_id2';
//found table:
$rTable = $rDoc->getElementById($sId);
foreach($rTable->childNodes as $rItem)
{
   //do something with item:
   //var_dump($rItem);
}

score 0 · Answer 2 · answered Sep 04 '13 at 12:17

0

Try this. It's very similar to what you had, but I put a non-capturing grouping around each row, as well as some leading and trailing optional whitespace in each row.

For reference, the regex used is

/<table[^>]*id="table_id2">((?:\s*<tr[^>]*><td>[^>]*<\/td><\/tr>\s*)+)/

answered Sep 04 '13 at 12:17

brianmearns

9,581
10
52
79

It just captures all the rows in one match element. – qasimzee Sep 05 '13 at 05:38
Ah, sorry, I missed that. – brianmearns Sep 05 '13 at 10:23

Finding all consecutive occurences of a pattern using preg_match after a specific string

2 Answers2