1

How can I grab all the text in a website, and I don't just mean ctrl+a/c. I'd like to be able to extract all the text from a website (and all the pages associated) and use it to build a concordance of words from that site. Any ideas?

slugster
  • 49,403
  • 14
  • 95
  • 145

1 Answers1

1

I was intrigued by this so I've written the first part of a solution to this.

The code is written in PHP because of the convenient strip_tags function. It's also rough and procedural but I feel in demonstrates my ideas.

<?php
$url = "http://www.stackoverflow.com";

//To use this you'll need to get a key for the Readabilty Parser API http://readability.com/developers/api/parser
$token = "";

//I make a HTTP GET request to the readabilty API and then decode the returned JSON
$parserResponse = json_decode(file_get_contents("http://www.readability.com/api/content/v1/parser?url=$url&token=$token"));

//I'm only interested in the content string in the json object
$content = $parserResponse->content;

//I strip the HTML tags for the article content
$wordsOnPage = strip_tags($content);

$wordCounter = array();

$wordSplit = explode(" ", $wordsOnPage);

//I then loop through each word in the article keeping count of how many times I've seen the word
foreach($wordSplit as $word)
{
incrementWordCounter($word);
}

//Then I sort the array so the most frequent words are at the end
asort($wordCounter);

//And dump the array
var_dump($wordCounter);

function incrementWordCounter($word)
{
    global $wordCounter;

    if(isset($wordCounter[$word]))
    {
    $wordCounter[$word] = $wordCounter[$word] + 1;
    }
    else
    {
    $wordCounter[$word] = 1;
    }

}


?> 

I needed to do this to configure PHP for the SSL the readability API uses.

The next step in the solution would be too search for links in the page and call this recursively in an intelligent way to hance the associated pages requirement.

Also the code above just gives the raw data of a word-count you would want to process it some more to make it meaningful.

Community
  • 1
  • 1
Joel
  • 587
  • 1
  • 5
  • 17