0

I am working on a crawling script in PHP. I am using PHP Simple HTML DOM Parser.

After getting the HTML I need to extract only some of the info from each page and aggregate these into my own HTML page on my site.

I am unable to understand how to proceed on this.

Any help is appreciated.

Added

I want to extract some posts (if related to a particular geography and topic)

AJ.
  • 2,561
  • 9
  • 46
  • 81
  • 1
    Jesus. where do you start. You will need some strategy for what you want to do. For example, you could use a file of keywords with some of the stuff you want to extract, you could implement a list indicating what stuff you want to pull out....Lots of ways to skin this cat.... – brumScouse Dec 08 '10 at 08:35
  • what exactly you want to extract....is it email addresses? – taher chhabrawala Dec 08 '10 at 08:42
  • 1
    *(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Dec 08 '10 at 08:43

2 Answers2

0

Regular expressions may be the way to get complex info out of the data but for simple tags you can use something like:


// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';

Skorpioh
  • 1,355
  • 1
  • 11
  • 30
0

You could do something like that:

$doc = new DomDocument();
@$doc->loadHTMLFile($url);
$xpath = new DOMXpath($doc);
$nodeList = $xpath->query("your-xpath-query");
foreach ($nodeList as $node) {
    // grab the content, attributes or whatever you'r looking for
}

Using Xpath queries you don't have to traverse the DOM tree manually and your script is more robust against structural changes in the sites you crawl.

I hope that gets you on the right track. For a more detailed example you have to provide more information.

rik
  • 8,592
  • 1
  • 26
  • 21