0

i wrote simple 3 functions to scrape titles , description and keywords of simple html page this is the first function to scrape titles

function getPageTitle ($url)
{
    $content = $url;
    if (eregi("<title>(.*)</title>", $content, $array)) {
        $title = $array[1];
        return $title;
    }
}

and it works fine and those are 2 functions to scrape description and keywords and those not working

function getPageKeywords($url)
{
    $content = $url; 
    if ( preg_match('/<meta[\s]+[^>]*?name[\s]?=[\s\"\']+keywords[\s\"\']+content[\s]?=[\s\"\']+(.*?)[\"\']+.*?>/i', $content, $array)) { 
        $keywords = $array[1];  
        return $keywords; 
    }  
}
function getPageDesc($url)
{
    $content = $url; 
    if ( preg_match('/<meta[\s]+[^>]*?name[\s]?=[\s\"\']+description[\s\"\']+content[\s]?=[\s\"\']+(.*?)[\"\']+.*?>/i', $content, $array)) { 
        $desc = $array[1];  
        return $desc; 
    }  
}

i know there may be something wrong with the preg_match line but i really don't know i tried it so much things but it doesn't work

Marco
  • 842
  • 6
  • 18
  • 42
  • 1
    Just a note: `eregi` is deprecated. http://php.net/manual/en/function.eregi.php – Will Jun 15 '12 at 03:33
  • 1
    Using regex to parse HTML falls over at anything more complex than a simple tag pair; when you try to start parsing tag attributes, you need to switch to PHP Dom: http://php.net/manual/en/book.dom.php The prob there is that the name, description and content attributes have to be in the order you are matching against. – Sp4cecat Jun 15 '12 at 03:37
  • third important point, just because it's on a web paged does not mean you have the rights do use the data any way you like (with out permission. –  Jun 15 '12 at 03:40
  • have you tried [Simple HTML DOM parser](http://simplehtmldom.sourceforge.net/manual.htm)? it's like jQuery DOM parsing. – tradyblix Jun 15 '12 at 03:40
  • [Tony the Pony](http://stackoverflow.com/a/1732454/118068) is coming to get you... and he HUNGERS. – Marc B Jun 15 '12 at 03:41
  • @Sp4cecat i tried it but i think its only parsing the structure between the tages only – Marco Jun 15 '12 at 03:51
  • @William Van Rensselaer i know but its my only choice i think – Marco Jun 15 '12 at 03:53
  • @Dagon im using this app for my own pages not to scrap information from other pages – Marco Jun 15 '12 at 03:53
  • @Marco refresh your page I answered 1 min ago...;p – Lawrence Cherone Jun 15 '12 at 03:54

2 Answers2

2

Why not use get_meta_tags? PHP Documentation Here

<?php
// Assuming the above tags are at www.example.com
$tags = get_meta_tags('http://www.example.com/');

// Notice how the keys are all lowercase now, and
// how . was replaced by _ in the key.
echo $tags['author'];       // name
echo $tags['keywords'];     // php documentation
echo $tags['description'];  // a php manual
echo $tags['geo_position']; // 49.33;-86.59
?>

NOTE You can change the parameter to either a URL, local file or string.

Mike Mackintosh
  • 13,917
  • 6
  • 60
  • 87
1

Its better to use php's native DOMDocument to parse HTML then regex, you can also use , tho in this day in age allot of sites dont even add the keywords, description tags no more, so you cant rely on them always being there. But here is how you can do it with DOMDocument:

<?php 
$source = file_get_contents('http://php.net');

$dom = new DOMDocument("1.0","UTF-8");
@$dom->loadHTML($source);
$dom->preserveWhiteSpace = false;

//Get Title
$title = $dom->getElementsByTagName('title')->item(0)->nodeValue;

$description = '';
$keywords = '';
foreach($dom->getElementsByTagName('meta') as $metas) {
    if($metas->getAttribute('name') =='description'){ $description = $metas->getAttribute('content'); }
    if($metas->getAttribute('name') =='keywords'){    $keywords = $metas->getAttribute('content');    }
}

print_r($title);
print_r($description);
print_r($keywords);
?> 
Lawrence Cherone
  • 46,049
  • 7
  • 62
  • 106