Why would this preg_match_all suddenly stop working?

Question

This code was working for days until it stopped working at the worst possible time. It simply pulls weather alert information from a NOAA website and displays it on my page. Can someone please tell me why this would suddenly fail?

$file = file_get_contents("http://forecast.weather.gov/showsigwx.php?warnzone=ARZ018&warncounty=ARC055");  
preg_match_all('#<div id="content">([^`]*?)<\/div>#', $file, $matches); 
$content = $matches[1];  

echo "content = ".$content."</br>" ;
echo "matches = ".$matches."</br>" ;
print_r ($matches); echo "</br>";
echo "file </br>".$file."</br></br>" ;

Now all I get is an empty array.

This is the output..

content = Array
matches = Array
Array ( [0] => Array ( ) [1] => Array ( ) )
file = the full page as requested by file_get_contents

score 7 · Answer 1 · edited May 23 '17 at 12:09

7

Your regexp is trying to match the literal string <div id="content">, followed by some (as few as possible) chars that are not backticks (`), followed by the literal string </div>.

However, in the current set of NOAA warnings and advisories, there is a backtick between <div id="content"> and </div>:

A SLIGHT RISK FOR SEVERE THUNDERSTORMS IS IN EFFECT FOR NORTHEAST
MISSISSIPPI SOUTH OF A CALHOUN CITY TO FULTON MISSISSIPPI LINE
FROM LATE THIS AFTERNOON THROUGH THIS EVENING. DAMAGING WINDS
WILL BE THE MAIN THREAT...HOWEVER AN ISOLATED TORNADO CAN`T BE
RULED OUT.

That's why your regexp doesn't match.

The simplest "fix" would be to replace the regexp with, say:

'#<div id="content">(.*?)<\/div>#s'

where . will, with the s modifier, match any character.

However, what you really should do is use a proper HTML parser to extract the text, instead of trying to parse HTML with regexps.

Edit: Here's a quick example (untested!) of how you could do this with DOMDocument:

$html = file_get_contents( $url );  
$doc = new DOMDocument();
$doc->loadHTML( $html );
$content = $doc->getElementById( 'content' )->textContent;

or even just:

$doc = new DOMDocument();
$doc->loadHTMLFile( $url );
$content = $doc->getElementById( 'content' )->textContent;

edited May 23 '17 at 12:09

Community

1
1

answered Dec 25 '12 at 17:29

Ilmari Karonen

49,047
9
93
153

WOW. Thank You. I never would have caught that. I am really new to php and am trying to learn as I go. What you you recommend that I use? – user1928523 Dec 25 '12 at 17:40
[DOMDocument::loadHTML()](http://docs.php.net/manual/en/domdocument.loadhtml.php) works pretty well and is built into PHP. – Ilmari Karonen Dec 25 '12 at 17:43
I really appreciate the input and I have been trying to get my head around this, unsuccessfully, for the past hour. I usually prefer to learn what I am doing but i really need to get this back up quickly. how would I implement the DOM method? – user1928523 Dec 25 '12 at 18:57
Only one error...Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: htmlParseEntityRef: no name in http://forecast.weather.gov/showsigwx.php?warnzone=ARZ018&warncounty=ARC055, line: 87 in C:\xampp\htdocs\greenecountyskywarn.org\alert.php on line 3 – user1928523 Dec 25 '12 at 19:42
@user1928523: No problem, glad to see it's working. Don't forget to click the check mark on the left to accept this answer! :) – Ilmari Karonen Dec 25 '12 at 20:56

Why would this preg_match_all suddenly stop working?

1 Answers1