Regex Html Tricky

Question

I have this regex line but it's not working perhaps due to newlines? My goal is to extract the passengers name and phone number.

Here is a snippet of the data i have... it's in a loop of 100 of the below:

<div class="booking-section">
    <h4>Passenger Details</h4>
    <p>
        <b>Passenger Name:</b><br />
        Ms Wendy Walker-hunter
    </p>

    <p>
        <b>Mobile Number:</b><br />
        161525961468
    </p>

I'm currently just trying to get passengers name first...

$re = '/(?<=Name)(.*)(?=Mobile)/s';
preg_match($re, $str, $matches);

// Print the entire match result
print_r($matches);

Any kind of help I can get on this is greatly appreciated!

You should use a DOM parser to extract this data. You can target each `.booking-section` element, and list the passenger name as the first `
` tag, and the mobile number as the second. Then you can strip out the `` and its contents, and the `
`. Don't use regex for this. — scrowler, Feb 20 '17 at 23:14

miken32 · Answer 1 · 2017-09-13T18:53:12.860

Never parse HTML with a regular expression. Here's how you should be doing this sort of thing:

$html = '<div class="booking-section">
    <h4>Passenger Details</h4>
    <p>
        <b>Passenger Name:</b><br />
        Ms Wendy Walker-hunter
    </p>

    <p>
        <b>Mobile Number:</b><br />
        161525961468
    </p>
</div>
<div class="booking-section">
    <h4>Passenger Details</h4>
    <p>
        <b>Passenger Name:</b><br />
        Mr John Walker
    </p>

    <p>
        <b>Mobile Number:</b><br />
        16153682486
    </p>
</div>
';
libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//div[@class='booking-section']/p[1]/text()[normalize-space()]");
foreach ($results as $node) {
    echo trim($node->textContent) . "\n";
}

This uses an XPath query to get the nodes you're looking for:

//div[@class='booking-section']/p[1]/text()[normalize-space()]

This tells it to select bare text nodes from the first <p> element inside a <div> with class attribute of "booking-section."

According to the documentation:

this function may generate E_WARNING errors when it encounters bad markup. libxml's error handling functions may be used to handle these errors.

I've enabled libxml's internal error handling for this example, to suppress any warnings about the HTML, though of course you should not be outputting warnings to users anyway.

thanks for this but i'm getting nasty errors `Warning: DOMDocument::loadHTML(): Misplaced DOCTYPE declaration in Entity, line: ` — thevoipman, Feb 21 '17 at 00:41
The code as provided works fine for me, are you trying it using the HTML that's above, or the full HTML document? — miken32, Feb 21 '17 at 00:42

Vasil Anagnostos · Accepted Answer · 2017-02-21T00:45:07.823

This should work if snippets are always formatted as the example, it relies on the new lines:

$t = '
<div class="booking-section">
  <h4>Passenger Details</h4>
  <p>
    <b>Passenger Name:</b><br />
    Ms Wendy Walker-hunter
  </p>
  <p>
    <b>Mobile Number:</b><br />
    161525961468
  </p>
</div>';

preg_match('/Passenger Name:[^\r?\n]+\r?\n([^\r?\n]+)\r?\n/', $t, $name);

preg_match('/Mobile Number:[^\r?\n]+\r?\n([^\r?\n]+)\r?\n/', $t, $phone);

echo trim($name[1]), ' / ', trim($phone[1]);

Outpus is: Ms Wendy Walker-hunter / 161525961468

Same with preg_match_all:

$t = '
<div class="booking-section">
  <h4>Passenger Details</h4>
  <p>
    <b>Passenger Name:</b><br />
    Ms Wendy Walker-hunter
  </p>
  <p>
    <b>Mobile Number:</b><br />
    161525961468
  </p>
</div>
<div class="booking-section">
  <h4>Passenger Details</h4>
  <p>
    <b>Passenger Name:</b><br />
    Ms Wendy Walker-hunter 2
  </p>
  <p>
    <b>Mobile Number:</b><br />
    161525961468 2
  </p>
</div>
<div class="booking-section">
  <h4>Passenger Details</h4>
  <p>
    <b>Passenger Name:</b><br />
    Ms Wendy Walker-hunter 3
  </p>
  <p>
    <b>Mobile Number:</b><br />
    161525961468 3
  </p>
</div>';

preg_match_all('/Passenger Name:[^\r?\n]+\r?\n([^\r?\n]+)\r?\n/', $t, $name);

preg_match_all('/Mobile Number:[^\r?\n]+\r?\n([^\r?\n]+)\r?\n/', $t, $phone);

echo '<pre>';
print_r($name);
print_r($phone);
die;

Output is something like

Array
(
    [1] => Array
    (
            [0] =>     Ms Wendy Walker-hunter
            [1] =>     Ms Wendy Walker-hunter 2
            [2] =>     Ms Wendy Walker-hunter 3
        )

)
Array
(
    [1] => Array
    (
            [0] =>     161525961468
            [1] =>     161525961468 2
            [2] =>     161525961468 3
        )

)

@thevoipman Or what if the whitespace doesn't match perfectly? That's one more reason why you shouldn't parse HTML with regular expressions. — miken32, Feb 21 '17 at 00:39
If it is not in a loop as you mentioned, you can use preg_match_all. — Vasil Anagnostos, Feb 21 '17 at 00:41

Regex Html Tricky

2 Answers2