0

I want to create a simple (one page) web application using Django, and see the top 20 websites from alexa.com/topsites/global. The page should render a table with 21 rows (1 header and 20 websites) and 3 columns (rank, website and description).

My knowledge using django is limitted and I really need some help if possible.

I've used a template to create a table using some bootstrap but I actually don't have any idea on how to parse: rank / website name / and description.

Could anybody lead me in the right direction with some usefull websites / code snippets ?

I know that I have to use HTMLParser and implement something like:

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

But I don't know how to use it on my requirements in my application.


So, I am comming back with an update. I've tried to do this (just to print the results to see if I get what I want) but I only get some links.

Any help ?

import urllib2, HTMLParser

class MyHTMLParser(HTMLParser.HTMLParser):
    def reset(self):
        HTMLParser.HTMLParser.reset(self)
        #count div to get the rank of website
        self.in_count_div = False
        #description div to get description of website
        self.in_description_div = False
        #a tag to get the url
        self.in_link_a = False

        self.count_items = None
        self.a_link_items = None
        self.description_items = None

    def handle_starttag(self, tag, attrs):
        if tag == 'div':
            if('class', 'count') in attrs:
                self.in_count_div = True

        if tag == 'a':
            for name, value in attrs:
                if name == 'href':
                    self.a_link_items = [value,'']
                    self.in_link_a = True
                    break

        if tag == 'div':
            if('class', 'description') in attrs:
                self.in_description_div = True

    #handle data for each section
    def handle_data_count(self, data):
        if self.in_count_div:
            self.count_items[1] += data

    def handle_data_url(self, data):
        if self.in_link_a:
            self.a_link_items[1] += data

    def handle_data_description(self, data):
        if self.in_description_div:
            self.description_items[1] += data

    #endtag
    def handle_endtag(self, tag):
        if tag =='div':
            if self.count_items is not None:
                print self.count_items
            self.count_items = None
            self.in_count_div = False

        if tag =='a':
            if self.a_link_items is not None:
                print self.a_link_items
            self.a_link_items = None
            self.in_link_a = False


if __name__ == '__main__':
    myhtml = MyHTMLParser()
    myhtml.feed(urllib2.urlopen('http://www.alexa.com/topsites/global').read())
Cajuu'
  • 1,154
  • 2
  • 19
  • 50

3 Answers3

2

If you want an API there is one for Alexa here

If you want too scrape, i'd suggest BeautifulSoup
(scrapy is to heavy for this since the only thing you'll be doing is reading from one URL.)

Doing this is simple:

  • Make a python module that deals with pulling data from the Alexa link using BeautifulSoup, in the module make it so that it runs the task every 5 minutes or any time span your application will be efficient with, then save it to your database.
  • To display the data you would retrieve it from the database then pass it to the template in a template variable, and the html should look something like this (don't use tables):
<table>
    {% for site in top_20_sites %}
    <tr>
        <td>{{site.rank}}</td>
        <td>{{site.name}}</td>
        <td>{{site.description}}</td>
    <\tr>
    {% endfor %}
</table>

As for how to scrape see this awesome tutorial here

HassenPy
  • 2,083
  • 1
  • 16
  • 31
  • 1
    Since you are new to django i suggest you go throught this tutorial: http://tangowithdjango.com/book17/ – HassenPy Mar 28 '15 at 16:35
  • It is necessary to add that info into a database ? Or can I retrieve it every 5 minutes from the website ? Let's say I'll have a list for `site.rank` , one for `site.name` and one for `site.descrpition`. I will go through each item with a `for` loop and put the info into a table. Is this possible ? – Cajuu' Mar 28 '15 at 18:34
  • I have tried something. Could you lead me in the right direction ? Thanks :) I don't want to use beautifulsoup. – Cajuu' Mar 29 '15 at 16:10
  • You can make it such that the scraping runs every time a user hits your page, but that will be unnecessary in most cases, and also resource consuming. You can set it to less than 5 minutes thou, it depends on the application requirements. i took a look at Alexa's page, you should look for an ``
  • `` element with ``class="site-listing"`` , nested inside that
  • you will find the description div with ``class="description"`` and a div with class="desc-paragraph" containing an anchor tag with the URL.
  • – HassenPy Mar 29 '15 at 19:55