Crawling and working on HTML for aggregation site

Question

I am working on a crawling script in PHP. I am using PHP Simple HTML DOM Parser.

After getting the HTML I need to extract only some of the info from each page and aggregate these into my own HTML page on my site.

I am unable to understand how to proceed on this.

Any help is appreciated.

Added

I want to extract some posts (if related to a particular geography and topic)

Jesus. where do you start. You will need some strategy for what you want to do. For example, you could use a file of keywords with some of the stuff you want to extract, you could implement a list indicating what stuff you want to pull out....Lots of ways to skin this cat.... — brumScouse, Dec 08 '10 at 08:35
*(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Dec 08 '10 at 08:43

score 0 · Accepted Answer · answered Dec 08 '10 at 08:40

Regular expressions may be the way to get complex info out of the data but for simple tags you can use something like:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';

rik · Answer 2 · 2010-12-08T08:47:50.160

0

You could do something like that:

$doc = new DomDocument();
@$doc->loadHTMLFile($url);
$xpath = new DOMXpath($doc);
$nodeList = $xpath->query("your-xpath-query");
foreach ($nodeList as $node) {
    // grab the content, attributes or whatever you'r looking for
}

Using Xpath queries you don't have to traverse the DOM tree manually and your script is more robust against structural changes in the sites you crawl.

I hope that gets you on the right track. For a more detailed example you have to provide more information.

edited Dec 08 '10 at 08:47

answered Dec 08 '10 at 08:41

rik

8,592
1
26
21

1

since you mention XQuery: care to share a mature XQuery extension or library for PHP? – Gordon Dec 08 '10 at 08:44
I mean xpath queries. Edited my answer. – rik Dec 08 '10 at 08:49

Crawling and working on HTML for aggregation site

2 Answers2