So, I want to crawl a webpage?

Question

Possible Duplicates:
How to write a crawler?
Best methods to parse HTML

I've always wondered how to do something like this. I am not the owner/admin/webmaster of the site (http://poolga.com/) however the information I wish to obtain is publicly available. This page here (http://poolga.com/artists) is a directory of all of the artist that have contributed to the site. However the links on this page go to another page which contains this anchor tag which contains the link to the artist actual website.

<a id="author-url" class="helv" target="_blank" href="http://aaaghr.com/">http://aaaghr.com/</a>

I hate having to command + click the links in the directory and then click the link to the artists website. I would love a way to have a batch of 10 of the artist website links appear as tabs in the browse just for temporary viewing. However just getting these href's into some-sort of array would be a feat itself. Any idea or direction / google searches within any programming language is great! Would this even be referred to as "crawling"? Thanks for reading!

UPDATE

I used Simple HTML DOM on my local php MAMP server with this script, took a little while!

$artistPages = array();
foreach(file_get_html('http://poolga.com/artists')->find('div#artists ol li a') as $element){
  array_push($artistPages,$element->href);
}

for ($counter = 0; $counter <= sizeof($artistPages)-1; $counter += 1) {
    foreach(file_get_html($artistPages[$counter])->find('a#author-url') as $element){
           echo $element->href . '<br>';
    }
}

Lots of topics covering this same thing : http://stackoverflow.com/search?q=%2Bhow+web+crawler also check out simple_html_dom. — JohnP, Apr 20 '11 at 17:54

Zirak · Accepted Answer · 2011-04-20T20:25:05.120

3

My favourite php library for navigating through the dom is Simple HTML DOM.

set_time_limit(0);
$poolga = file_get_html('http://poolga.com/artists');
$inRefs = $poolga->find('div#artists ol li a');
$links = array();

foreach ($inRefs as $ref) {
    $site = file_get_html($ref->href);
    $links[] = $site->find('a#author-url', 0)->href;
}

print_r($links);

Code, I think, is pretty self-explanatory.

Edit: Had a spelling mistake. It would take the script a really, really long time to finish, seeing as how there are so many links; that's why I used set_time_limit(). Go do other stuff and let the script run.

edited Apr 20 '11 at 20:25

answered Apr 20 '11 at 18:02

Zirak

38,920
13
81
92

Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon Apr 20 '11 at 18:04
I am pulling my hair out trying to get this to work, ive tried a couple of variations all with no results. I have no clue why it wont work? I'm sooo close, any ideas? I was successful at getting the href's on `http://poolga.com/artists` to work but not inside it. – ThomasReggi Apr 20 '11 at 20:10
See update for my solution. Wouldn't have got there without you, thanks for your help! – ThomasReggi Apr 20 '11 at 20:25
Edited; had a spelling mistake...accidentally typed $inrefs instead of $inRefs. Anyway, the execution time is long, since it's loading a ton of web pages. See edit comment. – Zirak Apr 20 '11 at 20:26
Hahaha, just noticed; I have 666 rep >:D – Zirak Apr 20 '11 at 20:26
@Zirak Not anymore ;) +1 for a much more complete example. – Aleadam Apr 20 '11 at 20:33

score 1 · Answer 2 · answered Apr 20 '11 at 18:05

Use some function to loop through the artist subpages (using jQuery as an example):

$("#artists li").each();

(each entry is under a <li> inside the <div id="artists">)

Then you will have to read each page search for the element <div id="artistSites"> or the <h2> id="author">

$("#author a").href();

The implementation details will depend on how different each page is. I only looked at two, so it may be a little more complicated than this.

So, I want to crawl a webpage?

2 Answers2