0

so I have the code

function getTagContent($string, $tagname) {

    $pattern = "/<$tagname.*?>(.*)<\/$tagname>/";
    preg_match($pattern, $string, $matches);


    print_r($matches);

}

and then I call

$url = "http://www.freakonomics.com/2008/09/24/wall-street-jokes-please/";
$html = file_get_contents($url);
getTagContent($html,"title");

but then it shows that there are no matches, while if you open the source of the url there clearly exist a title tag....

what did I do wrong?

kamikaze_pilot
  • 14,304
  • 35
  • 111
  • 171

4 Answers4

2

try DOM

$url  = "http://www.freakonomics.com/2008/09/24/wall-street-jokes-please/";
$doc  = new DOMDocument();
$dom  = $doc->loadHTMLFile($url);
$items = $doc->getElementsByTagName('title');
for ($i = 0; $i < $items->length; $i++)
{
  echo $items->item($i)->nodeValue . "\n";
}
ajreal
  • 46,720
  • 11
  • 89
  • 119
0

The 'title' tag is not on the same line as its closing tag, so your preg_match doesn't find it.

In Perl, you can add a /s switch to make it slurp the whole input as though on one line: I forget whether preg_match will let you do so or not.

But this is just one of the reasons why parsing XML and variants with regexp is a bad idea.

Colin Fine
  • 3,334
  • 1
  • 20
  • 32
0

Probably because the title is spread on multiple lines. You need to add the option s so that the dot will also match any line returns.

$pattern = "/<$tagname.*?>(.*)<\/$tagname>/s";
ripat
  • 3,076
  • 6
  • 26
  • 38
0

Have your php function getTagContent like this:

function getTagContent($string, $tagname) {
    $pattern = '/<'.$tagname.'[^>]*>(.*?)<\/'.$tagname.'>/is';
    preg_match($pattern, $string, $matches);
    print_r($matches);
}

It is important to use non-greedy match all .*? for matching text between start and end of tag and equally important is to use flags s for DOTALL (matches new line as well) and i for ignore case comparison.

anubhava
  • 761,203
  • 64
  • 569
  • 643