php and preg_match: how to capture data in mult line html?

Question

The goal is to capture Paris whci is included between

<th>City :</th><td>(.)*</td>

Here is the source I have

                <tr>
                    <th>postal code :</th>
                    <td>75012</td>
                </tr>

                <tr>
                    <th>City :</th>
                    <td>Paris</td>
                </tr>

I tried with

/<th>City :</th><td>(.)*</td>/gmi

expression with no luck. Any idea ?

[Don't parse HTML with regex!](http://stackoverflow.com/a/1732454/418066) — Biffen, Sep 06 '14 at 15:50

hwnd · Answer 1 · 2014-09-06T16:51:53.547

You have a few problems to deal with here.

PHP does not support the g (global) modifier and the m (multi-line) modifier causes ^ and $ to match the begin/end of each line. You can remove these, we don't need them.
You need to account for whitespace between the th and td elements.
You are repeating the capturing group (.)* so only the last iteration will be captured, in this case the letter s in Paris would be captured instead of the entire contents of that td element.

For this simple case, the following would be enough:

~<th>City :</th>\s*<td>(.*?)</td>~i

Note: The * operator follows the dot . saying match any character except newline "zero or more" times. When supplying the question mark after the operator *? you're telling the engine to return a non-greedy match.

However, for parsing HTML in the near future I would recommend using a tool such as DOM.

$dom = DOMDocument::loadHTML('
     <tr>
      <th>postal code :</th>
      <td>75012</td>
     </tr>
     <tr>
      <th>City :</th>
      <td>Paris</td>
     </tr>
');
$xp = new DOMXPath($dom);
$td = $xp->query('//th[contains(.,"City")]/following-sibling::*[1]');
echo $td->item(0)->nodeValue; //=> "Paris"

Avinash Raj · Answer 2 · 2014-09-06T15:52:52.957

1

You just need to enable dotall modifier and put .*? in-between the </th> and <td> tags, so that it would match the existing newline character. And also you need to put the * inside the capturing group or otherwise it would capture the last character in the string Paris

<th>City :</th>.*?<td>(.*?)</td>

DEMO

edited Sep 06 '14 at 15:52

answered Sep 06 '14 at 15:47

Avinash Raj

172,303
28
230
274

score 0 · Answer 3 · answered Sep 06 '14 at 15:51

0

Maybe slower but easier for broadly using: http://php.net/manual/en/class.domelement.php

answered Sep 06 '14 at 15:51

Juraj Carnogursky

370
2
15

php and preg_match: how to capture data in mult line html?

3 Answers3