-1

I'm trying to write a script which will crawl current top 10 PR/Alexa sites. since PR/Alexa frequently changes. so my script should take care of this I mean if today there is not a site in top 10 but could be tomorrow.

I dont know how to start with. I know crawling concepts but here I'm stuck. there could be top50 sites or even top500 sites. which I can configure of course.

I read about Google spider but its very complicated for this simple task. How do Google,Yahoo,Bing crawl billions of sites around the web. I'm just curious. what is the cursor point, I mean how google can Identify newly launch site.

Ok these are very deep details, I would read about these later. right now I'm more concern about my problem. how could I crawl top10 PR sites.

Can you provide a sample program so that I can understand better?

user3745956
  • 33
  • 2
  • 6
  • 1
    There is no such thing as a simple program for Information Retrieval problems. Google and Github are your friends. – eliasah Jun 27 '14 at 18:55
  • use this (updated every day) file with alexa top 1000 000: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip – Andrew Rukin Dec 11 '15 at 13:54

2 Answers2

1

It's rather simple to fetch top25sites (if I understood correctly what you wanted to do)

Code:

from bs4 import BeautifulSoup
from urllib.request import urlopen
b = BeautifulSoup(urlopen("http://www.alexa.com/topsites").read())
paragraphs = b.find_all('p', {'class':'desc-paragraph'})
for p in paragraphs:
   print(p.a.text)

Output:

Google.com
Facebook.com
Youtube.com
Yahoo.com
Baidu.com
Wikipedia.org
(...)

But have in mind that law in some countries could be more strict. Do it on own risk.

Dawid Gosławski
  • 2,028
  • 1
  • 18
  • 25
0

Alexa has a paid API you can use


**There is also a free API**


There is a free API (though I haven't been able to find any documentation for it anywhere).

http://data.alexa.com/data?cli=10&url=%YOUR_URL% You can also query for more data the following way:

http://data.alexa.com/data?cli=10&dat=snbamz&url=%YOUR_URL% All the letters in dat are the ones that determine wich info you get. This dat string is the one I've been able to find wich seems to have more options. Also, cli changes the output completly, this option makes it return an XML with quite a lot of information.

EDIT: This API is the one used by the Alexa toolbar.

Fetching Alexa data

Community
  • 1
  • 1
Eddie Martinez
  • 13,582
  • 13
  • 81
  • 106