I'm trying to write a script which will crawl current top 10 PR/Alexa sites. since PR/Alexa frequently changes. so my script should take care of this I mean if today there is not a site in top 10 but could be tomorrow.
I dont know how to start with. I know crawling concepts but here I'm stuck. there could be top50 sites or even top500 sites. which I can configure of course.
I read about Google spider but its very complicated for this simple task. How do Google,Yahoo,Bing crawl billions of sites around the web. I'm just curious. what is the cursor point, I mean how google can Identify newly launch site.
Ok these are very deep details, I would read about these later. right now I'm more concern about my problem. how could I crawl top10 PR sites.
Can you provide a sample program so that I can understand better?