Capturing text within HTML tag using PHP and preg_match

Question

I am hitting a road block with a script I have to check availability on a certain website. I need the text within html tags and I am unsure how to approach it.

My code I have tested ended with this:

<?php
ini_set("allow_url_fopen", 1);
$homepage2 = file_get_contents('https://www.someurlwithavailability.com');
//URL has the following HTML tag: <div id="Availability">
                            Availability: Special Offer, ships within 10 - 15 business days                         </div>"
preg_match("/<div id="Availability">(.*?)</div>/si", $homepage2, $avail);
print_r($avail);
echo '<br>', '~Availability is~', '<br>', $avail, '<br>';
$stringavail=implode(" ",$avail);
echo $stringavail;
?>

I get various errors depending on what I put after preg_match(***,$homepage2, $avail); and I am unsure about what syntax I need to enter to retrieve the text.

My code above gives me this:

Parse error: syntax error, unexpected 'Availability' (T_STRING) in /u/o/placeiamrunningthecodefrom.php on line 6

The URL that is requested comes back with a full HTML page that is quite large. This HTML tag is unique and does not repeat.

Anyone able to help me out?

is there any option to use php dom ? I prefer use php dom to parse a html string if the element id / class is not dynamic — rheeantz, Jun 02 '17 at 21:26
I read about DOM but I am confused at how it will modify html I have to work with. So I have tried to stay away from it. — Chris, Jun 03 '17 at 02:30
Through a combination of these answers I now have a solution. Thank you! — Chris, Jun 03 '17 at 02:44

Barmar · Answer 1 · 2017-06-02T21:39:45.793

0

The problem is that you have double quotes inside your double-quoted string, and didn't escape them:

preg_match("/<div id="Availability">(.*?)</div>/si", $homepage2, $avail);
                     ^            ^

If you used a decent IDE it would have alerted you to this as you were typing.

Simply change the delimiting quotes to single quotes.

Also, since your regexp delimiter / appears in the regular expression, you either need to escape the character where it appears in the regexp, or use a delimiter that isn't in the expression.

preg_match('#<div id="Availability">(.*?)</div>#si', $homepage2, $avail);

However, using regular expressions to parse HTML is generally a bad idea. You should use a DOM parser library like the DOMDocument class.

edited Jun 02 '17 at 21:39

answered Jun 02 '17 at 21:31

Barmar

741,623
53
500
612

That is not the only problem. / Is used as delimiter but is also in the end div tag. – Andreas Jun 02 '17 at 21:33
Good catch, I've updated the answer to deal with this. – Barmar Jun 02 '17 at 21:40
I've tried a few combinations of delimiters and I kept running into the same issues of syntax problems. I will try your suggestion with the pound sign – Chris Jun 03 '17 at 02:32

score 0 · Answer 2 · answered Jun 02 '17 at 21:32

Although this can work just fine with regex. It's not recommended, nor easier.

Id suggest giving DOMDocument::getElementById a go. It even has an example right on the page:

$doc = new DomDocument;

// We need to validate our document before refering to the id
$doc->validateOnParse = true;
$doc->Load('book.xml');

echo "The element whose id is 'php-basics' is: " . $doc->getElementById('php-basics')->tagName . "\n";

Now to get the content instead of tagName we can use ->textContent as inherited from domnode

score 0 · Answer 3 · answered Jun 02 '17 at 21:38

0

Try using single quotes around that pattern. And, make sure you are escaping the special regex characters. And, you are essentially asking for everything to the last </div>. So, you need to be more specific.

'/<div id="Availability">([^<]*)<\/div>/si'

instead of

"/<div id="Availability">(.*?)</div>/si"

Of course, this could still be unreliable if there is html in that the <div>

But, this should get you closer.

Also, try an online regex tool. I like this one. https://regex101.com/

answered Jun 02 '17 at 21:38

Dan Hawkins

71
4

I assumed the spaces between the text I was going after might have an issue as well. I will try this and see what happens. – Chris Jun 03 '17 at 02:33
Changing to ([^<]*) in the code seemed to have caught the whole string between the tags. Having it like (.*?) left the output blank. Probably picked up a white space? Not sure why. – Chris Jun 03 '17 at 02:49

Capturing text within HTML tag using PHP and preg_match

3 Answers3