24

For a web application I'm building I need to analyze a website, retrieve and rank it's most important keywords and display those.

Getting all words, their density and displaying those is relatively simple, but this gives very skewed results (e.g. stopwords ranking very high).

Basically, my question is: How can I create a keyword analysis tool in PHP which results in a list correctly ordered by word importance?

mschr
  • 8,531
  • 3
  • 21
  • 35
Jeroen
  • 13,056
  • 4
  • 42
  • 63
  • 1
    You could try looking at something like Lucene (keywordAnalyzer)... – CD001 May 23 '12 at 14:31
  • For the stopwords problem I use the wordnet database. I also use this control to visualize Density + Relationship. http://www.codeproject.com/Articles/342715/Plotting-Circular-Relationship-Graphs-with-Silverl – Leblanc Meneses Jun 02 '12 at 01:07
  • That looks very promising, thank you! – Jeroen Jun 02 '12 at 08:07

5 Answers5

53

Recently, I've been working on this myself, and I'll try to explain what I did as best as possible.

Steps

  1. Filter text
  2. Split into words
  3. Remove 2 character words and stopwords
  4. Determine word frequency + density
  5. Determine word prominence
  6. Determine word containers
    1. Title
    2. Meta description
    3. URL
    4. Headings
    5. Meta keywords
  7. Calculate keyword value

1. Filter text

The first thing you need to do is filter make sure the encoding is correct, so convert is to UTF-8:

iconv ($encoding, "utf-8", $file); // where $encoding is the current encoding

After that, you need to strip all html tags, punctuation, symbols and numbers. Look for functions on how to do this on Google!

2. Split into words

$words = mb_split( ' +', $text );

3. Remove 2 character words and stopwords

Any word consisting of either 1 or 2 characters won't be of any significance, so we remove all of them.

To remove stopwords, we first need to detect the language. There are a couple of ways we can do this: - Checking the Content-Language HTTP header - Checking lang="" or xml:lang="" attribute - Checking the Language and Content-Language metadata tags If none of those are set, you can use an external API like the AlchemyAPI.

You will need a list of stopwords per language, which can be easily found on the web. I've been using this one: http://www.ranks.nl/resources/stopwords.html

4. Determine word frequency + density

To count the number of occurrences per word, use this:

$uniqueWords = array_unique ($keywords); // $keywords is the $words array after being filtered as mentioned in step 3
$uniqueWordCounts = array_count_values ( $words );

Now loop through the $uniqueWords array and calculate the density of each word like this:

$density = $frequency / count ($words) * 100;

5. Determine word prominence

The word prominence is defined by the position of the words within the text. For example, the second word in the first sentence is probably more important than the 6th word in the 83th sentence.

To calculate it, add this code within the same loop from the previous step:'

$keys = array_keys ($words, $word); // $word is the word we're currently at in the loop
$positionSum = array_sum ($keys) + count ($keys);
$prominence = (count ($words) - (($positionSum - 1) / count ($keys))) * (100 /   count ($words));

6. Determine word containers

A very important part is to determine where a word resides - in the title, description and more.

First, you need to grab the title, all metadata tags and all headings using something like DOMDocument or PHPQuery (dont try to use regex!) Then you need to check, within the same loop, whether these contain the words.

7. Calculate keyword value

The last step is to calculate a keywords value. To do this, you need to weigh each factor - density, prominence and containers. For example:

$value = (double) ((1 + $density) * ($prominence / 10)) * (1 + (0.5 * count ($containers)));

This calculation is far from perfect, but it should give you decent results.

Conclusion

I haven't mentioned every single detail of what I used in my tool, but I hope it offers a good view into keyword analysis.

N.B. Yes, this was inspired by the today's blogpost about answering your own questions!

Jeroen
  • 13,056
  • 4
  • 42
  • 63
  • 1
    Note: If someone has any ideas on how to improve this, you're more than welcome to edit my answer or add another answer, I'd love to hear them! – Jeroen May 23 '12 at 14:27
  • @Jeroen you can improve the speed of the whole process by moving the most intensive part into a php extension. – Vlad Balmos May 30 '12 at 13:57
  • @Vlad: That would mean writing the whole code in C, right? It's definitely a way to speed it up significantly, but unfortunately, I currently lack the expertise to do so. – Jeroen May 30 '12 at 16:10
  • @Jeroen yes, that would require C because it provides a significat speed boost. – Vlad Balmos May 30 '12 at 17:02
  • @Jeroen this is the best available solution. mark this as answer. – Alfred Jun 01 '12 at 14:56
  • I would also add a stemming algorithm (http://snowball.tartarus.org/texts/introduction.html) – Fabien Jun 01 '12 at 17:24
  • @Jeroen sorry to bring this up almost 2 years after this got placed. But as this answer is by far the best one i could find on the internet i'd like to ask for your help on step 6-7 if you're still willing to help or remember anything from this answer (as it's been 2 years) – Déjà vu Apr 02 '14 at 10:30
  • @Jeroen this is my question: http://stackoverflow.com/questions/22808192/php-domdocument-finding-words if you're interested. – Déjà vu Apr 02 '14 at 14:18
  • It's worth noting that Sphinxsearch ties into MySQL, Supports stripping HTML, and has an advanced version of this (across all documents), is much faster (with less memory usage), and can be connected to as if it was a MySQL instance. – Xeoncross Feb 21 '17 at 22:20
4

One thing which is missing in your algorithm is document-oriented analysis (if you didn't omit it intentionally for some reason).

Every site is built on a document set. Counting word frequencies for all and every document will provide you with information about words coverage. The words which occur in most of documents are stop words. The words specific for a limited number of documents can form a cluster of documents on a specific topic. Number of documents pertaining to a specific topic can increase overall importance of the words of the topic, or at least provide an additional factor to be counted in your formulae.

Perhaps, you could benefit from a preconfigured classificator which contains categories/topics and keywords for each of them (this task can be partially automated by indexing existing public hierarchies of categories, up to Wikipedia, but this is not a trivial task itself). Then you can involve categories into analisys.

Also, you can improve statistics by analysis on sentence-level. That is, having frequencies of how often words occur in the same sentence or phrase, you can discover cliches and duplicates and eliminate them from statistics. But, i'm afraid this is not easily impemented in pure PHP.

Stan
  • 8,683
  • 9
  • 58
  • 102
  • Though this is too advanced the way I'm applying keyword analysis, these are excellent suggestions on how to improve it, thanks! – Jeroen May 31 '12 at 20:21
  • @Jeroen, BTW, filtering out html tags on the very first step can drop important information about document structure. I suggest to analyse document as html-document first, detect its main content block, and only then apply your algrithm on the main content. This will allow you to eliminate menus, forms, footers and headers, all auxiliary stuff, from consideration. – Stan May 31 '12 at 20:50
  • I tried that, using the Readability project (http://www.keyvan.net/2010/08/php-readability/), but sometimes it will get the incorrect block of text. Also, I analyze mostly frontpages of websites, so they often don't really have a main text block. – Jeroen May 31 '12 at 20:52
  • However, if someone would use it to analyze something like articles/blog posts, it's definitely a good idea! – Jeroen May 31 '12 at 20:53
4

This is probably a small contribution, but I'll mention it nonetheless.

Context scoring

To a certain extent you're already looking at the context of a word by using the position in which it's placed. You could add another factor to this by ranking words that appear in a heading (H1, H2, etc.) higher than words inside a paragraph, higher than perhaps words in a bulleted list, etc.

Frequency sanitization

Detecting stop words based on a language might work, but perhaps you could consider using a bell curve to determine which word frequencies / densities are too extravagant (e.g. strip lower 5% and upper 95%). Then apply the scoring on the remaining words. Not only does it prevent stop words, but also key word abuse, at least in theory :)

Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
4

@ refining 'Steps'

In regards to doing these many steps, i would go with a bit 'enhanced' solution, suturing some of your steps together.

Not sure, whether a full lexer is better though, if you design it perfectly to fit your needs, e.g. look only for text within hX etc. But you would have to mean _serious business since it can be a headache to implement. Though i will put my point out and say that a Flex / Bison solution in another language (PHP offers poor support as it is such a high-level language) would be an 'insane' speed boost.

However, luckily libxml provides magnificent features and as the following should show, you will end up having multiple steps in one. Before the point where you analyse the contents, setup language(stopwords), minify the NodeList set and work from there.

  1. load full page in
  2. detect language
  3. extract only <body> into seperate field
  4. release a tad of memory from <head> and others like, eg. unset($fullpage);
  5. fire your algorithm (if pcntl - linux host - is available, forking and releasing browser is a nice feature)

While using DOM parsers, one should realize that settings may introduce further validation for attributes href and src, depending on library (such as parse_url and likes)

Another way of getting by the timeout / memory consumption stuff is to call php-cli (also works for a windows host) and 'get on with business' and start next document. See this question for more info.

If you scroll down a bit, look at the proposed schema - initial crawling would put only body in database (and additionally lang in your case) and then run a cron-script, filling in the ft_index whilst using the following function

    function analyse() {
        ob_start(); // dont care about warnings, clean ob contents after parse
        $doc->loadHTML("<html><head><meta http-equiv=\"Content-Type\" content=\"text/html;charset=UTF-8\"/></head><body><pre>" . $this->html_entity_decode("UTF-8") . "</pre></body>");
        ob_end_clean();
        $weighted_ft = array('0'=>"",'5'=>"",'15'=>"");

        $includes = $doc->getElementsByTagName('h1');
        // relevance wieght 0
        foreach ($includes as $h) {


                $text = $h->textContent;
                // check/filter stopwords and uniqueness
                // do so with other weights as well, basically narrow it down before counting
                $weighted_ft['0'] .= " " . $text;


        }
        // relevance wieght 5
        $includes = $doc->getElementsByTagName('h2');
        foreach ($includes as $h) {
            $weighted_ft['5'] .= " " . $h->textContent;
        }
        // relevance wieght 15
        $includes = $doc->getElementsByTagName('p');
        foreach ($includes as $p) {
            $weighted_ft['15'] .= " " . $p->textContent;
        }
            // pseudo; start counting frequencies and stuff
            // foreach weighted_ft sz do 
            //   foreach word in sz do 
            //      freqency / prominence
 }

    function html_entity_decode($toEncoding) {
        $encoding = mb_detect_encoding($this->body, "ASCII,JIS,UTF-8,ISO-8859-1,ISO-8859-15,EUC-JP,SJIS");
        $body = mb_convert_encoding($this->body, $toEncoding, ($encoding != "" ? $encoding : "auto"));
        return html_entity_decode($body, ENT_QUOTES, $toEncoding);
    }

The above is a class, resembling your database which has the page 'body' field loaded in prehand.

Again, as far as database handling goes, i ended up inserting the above parsed result into a full-text flagged tablecolumn so that future lookups would go seemlessly. This is a huge advantage for db engines.

Note on full-text indexing:

When dealing with a small number of documents it is possible for the full-text search engine to directly scan the contents of the documents with each query, a strategy called serial scanning. This is what some rudimentary tools, such as grep, do when searching.

Your indexing algorithm filters out some words, ok.. But these are enumerated by how much weight they carry - there is a strategy to think out here, since a full-text string does not carry over the weights given. That is why in the example, as basic strategy on splitting strings into 3 different strings is given.

Once put into database, the columns should then resemble this, so a schema could be like so, where we would maintain weights - and still offer a superfast query method

CREATE TABLE IF NOT EXISTS `oo_pages` (
  `id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
  `body` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'PageBody entity encoded html',
  `title` varchar(31) COLLATE utf8_danish_ci NOT NULL,
  `ft_index5` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'Regenerated cron-wise, weighted highest',
  `ft_index10` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'Regenerated cron-wise, weighted medium',
  `ft_index15` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'Regenerated cron-wise, weighted lesser',
  `ft_lastmodified` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT 'last cron run',
  PRIMARY KEY (`id`),
  UNIQUE KEY `alias` (`alias`),
  FULLTEXT KEY `ft_index5` (`ft_index5`),
  FULLTEXT KEY `ft_index10` (`ft_index10`),
  FULLTEXT KEY `ft_index15` (`ft_index15`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci;

One may add an index like so:

ALTER TABLE `oo_pages` ADD FULLTEXT (
`named_column`
)

The thing about detecting language and then selecting your stopword database from that point is a feature I myself have left out but its nifty - And By The Book! So cudos for your efforts and this answer :)

Also, keep in mind there's not only the title tag, but also anchor / img title attributes. If for some reason your analytics goes into a spider-like state, i would suggest combining the reference link (<a>) title and textContent with the target page <title>

Community
  • 1
  • 1
mschr
  • 8,531
  • 3
  • 21
  • 35
  • Thank you for these great suggestions! I'll start working on rewriting my code soon based on the code you provided (will probably put it on GitHub) One thing though, what exactly do you mean by `a full-text flagged table (weightless) field`? – Jeroen Jun 02 '12 at 15:29
  • it varies from database to database what the word 'fulltext' means. Personally i only work MySQL dbs. Here you would have to create a table (or alter one) into using MyISAM and then set an index for your column. Only CHAR, VARCHAR, or TEXT columns are usable, quite a given though wouldnt you say? :) Check: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html whilst keeping caveats in mind: http://dev.mysql.com/doc/refman/5.0/en/fulltext-restrictions.html – mschr Jun 02 '12 at 18:19
  • `full-text` is actually one of the words I did understand, I worked with MySQL before, but what do you mean by `flagged` and `weightless`? – Jeroen Jun 02 '12 at 18:33
  • ill refine answer - full-text is very useful for searching site contents (i.e. like google :) since it uses hashes, indexed by tokens instead of the 'serial approach', which is similar to `grep`. On a large set of documents, serial searches gets lenghty. Strike the remark on 'weightless' its leftover from my own algo where i dont put weights on keywords and where i only fly with one column – mschr Jun 02 '12 at 18:51
  • All right, that explains it, thanks! (though I won't need it, I'm just saving the keyword list) – Jeroen Jun 02 '12 at 18:59
2

I'd recommend instead of re-inventing the wheel, you use Apache SoIr for search and analysis. It has almost everything you might need, including stop-word detection for 30+ languages [as far as I can remember, might be even more] and do tons of stuff with data stored in it.

Gelmir
  • 1,829
  • 1
  • 18
  • 28
  • 1
    I'm not building a search function, I'm using it to display keyword analysis of a website; as far as I know, Solr/Lucene is not able to do so.. – Jeroen May 31 '12 at 13:20