regex help with getting tag content in PHP

Question

so I have the code

function getTagContent($string, $tagname) {

    $pattern = "/<$tagname.*?>(.*)<\/$tagname>/";
    preg_match($pattern, $string, $matches);


    print_r($matches);

}

and then I call

$url = "http://www.freakonomics.com/2008/09/24/wall-street-jokes-please/";
$html = file_get_contents($url);
getTagContent($html,"title");

but then it shows that there are no matches, while if you open the source of the url there clearly exist a title tag....

what did I do wrong?

possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — dynamic, Jun 13 '11 at 13:45
Is there any code converting the `$url` into `$html`? :) have you verified that `$html` does actually contain the `` (not by visiting url, but by outputting it from your code)? — mkilmanas, Jun 13 '11 at 13:48
[Dont use regex to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454) — ssapkota, Jun 13 '11 at 13:49

score 2 · Accepted Answer · answered Jun 13 '11 at 14:06

try DOM

$url  = "http://www.freakonomics.com/2008/09/24/wall-street-jokes-please/";
$doc  = new DOMDocument();
$dom  = $doc->loadHTMLFile($url);
$items = $doc->getElementsByTagName('title');
for ($i = 0; $i < $items->length; $i++)
{
  echo $items->item($i)->nodeValue . "\n";
}

score 0 · Answer 2 · answered Jun 13 '11 at 13:52

The 'title' tag is not on the same line as its closing tag, so your preg_match doesn't find it.

In Perl, you can add a /s switch to make it slurp the whole input as though on one line: I forget whether preg_match will let you do so or not.

But this is just one of the reasons why parsing XML and variants with regexp is a bad idea.

score 0 · Answer 3 · answered Jun 13 '11 at 13:56

0

Probably because the title is spread on multiple lines. You need to add the option s so that the dot will also match any line returns.

$pattern = "/<$tagname.*?>(.*)<\/$tagname>/s";

answered Jun 13 '11 at 13:56

ripat

3,076
6
26
38

score 0 · Answer 4 · answered Jun 13 '11 at 13:58

Have your php function getTagContent like this:

function getTagContent($string, $tagname) {
    $pattern = '/<'.$tagname.'[^>]*>(.*?)<\/'.$tagname.'>/is';
    preg_match($pattern, $string, $matches);
    print_r($matches);
}

It is important to use non-greedy match all .*? for matching text between start and end of tag and equally important is to use flags s for DOTALL (matches new line as well) and i for ignore case comparison.

regex help with getting tag content in PHP

4 Answers4

Linked

Related