RegEx to get the keywords from HTML

Question

I'm trying to obtain the keywords from an HTML page that I'm scraping with PHP.

So, if the keywords tag looks like this:

<meta name="Keywords" content="MacUpdate, Mac Software, Macintosh Software, Mac Games, Macintosh Games, Apple, Macintosh, Software, iphone, ipod, Games, Demos, Shareware, Freeware, MP3, audio, sound, macster, napster, macintel, universal binary">

I want to get this back:

MacUpdate, Mac Software, Macintosh Software, Mac Games, Macintosh Games, Apple, Macintosh, Software, iphone, ipod, Games, Demos, Shareware, Freeware, MP3, audio, sound, macster, napster, macintel, universal binary

I've constructed a regex, but it's not doing the trick.

(?i)^(<meta name=\"keywords\" content=\"(.*)\">)

Any ideas?

score 3 · Answer 1 · answered Nov 15 '09 at 16:16

I would use a HTML/XML parser like DOMDocument and XPath to retrieve the nodes from the DOM:

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$keywords = $xpath->query('//meta[translate(normalize-space(@name), "KEYWORDS", "keywords")="keywords"]/@content');
foreach ($keywords as $keyword) {
    echo $keyword->value;
}

The translate function seems to be necessary as PHP’s XPath implementation does not know the lower-case function.

Or you do the filtering with PHP:

$metas = $xpath->query('//meta');
foreach ($metas as $meta) {
    if ($meta->hasAttribute("name") && trim(strtolower($meta->getAttribute("name")))=='keywords' && $meta->hasAttribute("content")) {
        echo $meta->getAttribute("content")->value;
    }
}

@Svante: But `get_meta_tags` expects a filename and not the HTML source. — Gumbo, Nov 15 '09 at 16:48

score 2 · Answer 2 · answered Nov 06 '12 at 20:17

Stop using regex. It's slow, resource intensive, and not very nimble.

If you're programming in PHP check out http://simplehtmldom.sourceforge.net/ - SimpleDom is powerful enough to get you everything you need in a very simple object-oriented way.

    // Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';

Another example -

// Example
$html = str_get_html("<div>foo <b>bar</b></div>"); 
$e = $html->find("div", 0);

echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"

score 2 · Accepted Answer · answered Nov 15 '09 at 16:14

2

Use the function get_meta_tags();

Tutorial

answered Nov 15 '09 at 16:14

Cups

6,901
3
26
30

When fetching stuff to work on, I am guessing that getting the keywords is only one operation, I always do it in 2 bites. 1) Get the file and store it locally 2) Do my post-fetch ripping I just find that more reliable as so much can go wrong when fetching from the web. But if you're only after the keywords, why bother getting the file, just use file_get_meta() ; – Cups Nov 15 '09 at 18:26
Was not aware of the get_meta_tags function. Awesome - thanks! – TWLATL Nov 16 '09 at 14:38

score 1 · Answer 4 · answered Nov 15 '09 at 15:57

1

(.*) matches everything up to the LAST "(quote) in the document, obviously not what you want. Regex is greedy by default. You need to use

content=\"(.*?)\"

or

content=\"([^\"]*)\"

answered Nov 15 '09 at 15:57

yu_sha

4,290
22
19

That won't work completely, since he uses the `^`, so the meta-element needs to be at the beginning of the html which should never be the case. – Joost Nov 15 '09 at 16:01

score 1 · Answer 5 · edited May 23 '17 at 12:19

1

Stop trying to parse HTMl with regular expressions.

RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 12:19

Community

1
1

answered Nov 15 '09 at 18:31

Ether

53,118
13
86
159

score 0 · Answer 6 · answered Nov 15 '09 at 15:49

0

(?i)<meta\\s+name=\"keywords\"\\s+content=\"(.*?)\">

Would produce something like:

preg_match('~<meta\\s+name=\"keywords\"\\s+content=\"(.*?)\">~i', $html, &$matches);

answered Nov 15 '09 at 15:49

Joost

10,333
4
55
61

gnud · Answer 7 · 2009-11-15T16:05:12.643

0

This is a simple regex, that matches the first meta keywords tag. It only allows characters, numbers, legal URL characters, HTML entities and spaces to appear inside the content attribute.

$matches = array();
preg_match("/<meta name=\"Keywords\" content=\"([\w\d;,\.: %&#\/\\\\]*)\"/", $html, $matches);
echo $matches[1];

edited Nov 15 '09 at 16:05

answered Nov 15 '09 at 15:53

gnud

77,584
5
64
78

RegEx to get the keywords from HTML

7 Answers7