0

The goal is to capture Paris whci is included between

<th>City :</th><td>(.)*</td>

Here is the source I have

                <tr>
                    <th>postal code :</th>
                    <td>75012</td>
                </tr>

                <tr>
                    <th>City :</th>
                    <td>Paris</td>
                </tr>

I tried with

/<th>City :</th><td>(.)*</td>/gmi

expression with no luck. Any idea ?

yarek
  • 11,278
  • 30
  • 120
  • 219

3 Answers3

2

You have a few problems to deal with here.

  1. PHP does not support the g (global) modifier and the m (multi-line) modifier causes ^ and $ to match the begin/end of each line. You can remove these, we don't need them.

  2. You need to account for whitespace between the th and td elements.

  3. You are repeating the capturing group (.)* so only the last iteration will be captured, in this case the letter s in Paris would be captured instead of the entire contents of that td element.

For this simple case, the following would be enough:

~<th>City :</th>\s*<td>(.*?)</td>~i

Note: The * operator follows the dot . saying match any character except newline "zero or more" times. When supplying the question mark after the operator *? you're telling the engine to return a non-greedy match.

However, for parsing HTML in the near future I would recommend using a tool such as DOM.

$dom = DOMDocument::loadHTML('
     <tr>
      <th>postal code :</th>
      <td>75012</td>
     </tr>
     <tr>
      <th>City :</th>
      <td>Paris</td>
     </tr>
');
$xp = new DOMXPath($dom);
$td = $xp->query('//th[contains(.,"City")]/following-sibling::*[1]');
echo $td->item(0)->nodeValue; //=> "Paris"
hwnd
  • 69,796
  • 4
  • 95
  • 132
1

You just need to enable dotall modifier and put .*? in-between the </th> and <td> tags, so that it would match the existing newline character. And also you need to put the * inside the capturing group or otherwise it would capture the last character in the string Paris

<th>City :</th>.*?<td>(.*?)</td>

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
0

Maybe slower but easier for broadly using: http://php.net/manual/en/class.domelement.php