1

I'm trying to obtain the keywords from an HTML page that I'm scraping with PHP.

So, if the keywords tag looks like this:

<meta name="Keywords" content="MacUpdate, Mac Software, Macintosh Software, Mac Games, Macintosh Games, Apple, Macintosh, Software, iphone, ipod, Games, Demos, Shareware, Freeware, MP3, audio, sound, macster, napster, macintel, universal binary">

I want to get this back:

MacUpdate, Mac Software, Macintosh Software, Mac Games, Macintosh Games, Apple, Macintosh, Software, iphone, ipod, Games, Demos, Shareware, Freeware, MP3, audio, sound, macster, napster, macintel, universal binary

I've constructed a regex, but it's not doing the trick.

(?i)^(<meta name=\"keywords\" content=\"(.*)\">)

Any ideas?

TWLATL
  • 2,859
  • 4
  • 25
  • 37

7 Answers7

3

I would use a HTML/XML parser like DOMDocument and XPath to retrieve the nodes from the DOM:

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$keywords = $xpath->query('//meta[translate(normalize-space(@name), "KEYWORDS", "keywords")="keywords"]/@content');
foreach ($keywords as $keyword) {
    echo $keyword->value;
}

The translate function seems to be necessary as PHP’s XPath implementation does not know the lower-case function.

Or you do the filtering with PHP:

$metas = $xpath->query('//meta');
foreach ($metas as $meta) {
    if ($meta->hasAttribute("name") && trim(strtolower($meta->getAttribute("name")))=='keywords' && $meta->hasAttribute("content")) {
        echo $meta->getAttribute("content")->value;
    }
}
Gumbo
  • 643,351
  • 109
  • 780
  • 844
2

Stop using regex. It's slow, resource intensive, and not very nimble.

If you're programming in PHP check out http://simplehtmldom.sourceforge.net/ - SimpleDom is powerful enough to get you everything you need in a very simple object-oriented way.

    // Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';

Another example -

// Example
$html = str_get_html("<div>foo <b>bar</b></div>"); 
$e = $html->find("div", 0);

echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"
Wes
  • 467
  • 3
  • 11
2

Use the function get_meta_tags();

Tutorial

Cups
  • 6,901
  • 3
  • 26
  • 30
  • When fetching stuff to work on, I am guessing that getting the keywords is only one operation, I always do it in 2 bites. 1) Get the file and store it locally 2) Do my post-fetch ripping I just find that more reliable as so much can go wrong when fetching from the web. But if you're only after the keywords, why bother getting the file, just use file_get_meta() ; – Cups Nov 15 '09 at 18:26
  • Was not aware of the get_meta_tags function. Awesome - thanks! – TWLATL Nov 16 '09 at 14:38
1

(.*) matches everything up to the LAST "(quote) in the document, obviously not what you want. Regex is greedy by default. You need to use

content=\"(.*?)\"

or

content=\"([^\"]*)\"
yu_sha
  • 4,290
  • 22
  • 19
  • That won't work completely, since he uses the `^`, so the meta-element needs to be at the beginning of the html which should never be the case. – Joost Nov 15 '09 at 16:01
1

Stop trying to parse HTMl with regular expressions.

RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
Ether
  • 53,118
  • 13
  • 86
  • 159
0

(?i)<meta\\s+name=\"keywords\"\\s+content=\"(.*?)\">

Would produce something like:

preg_match('~<meta\\s+name=\"keywords\"\\s+content=\"(.*?)\">~i', $html, &$matches);
Joost
  • 10,333
  • 4
  • 55
  • 61
0

This is a simple regex, that matches the first meta keywords tag. It only allows characters, numbers, legal URL characters, HTML entities and spaces to appear inside the content attribute.

$matches = array();
preg_match("/<meta name=\"Keywords\" content=\"([\w\d;,\.: %&#\/\\\\]*)\"/", $html, $matches);
echo $matches[1]; 
gnud
  • 77,584
  • 5
  • 64
  • 78