1

I have the following problem:

I would like to parse html files and get links from the html file. I can get links with the following code:

class MyHTMLParser(HTMLParser):
    links=[]
    def __init__(self,url):
        HTMLParser.__init__(self)
        self.url = url

    def handle_starttag(self, tag, attrs):
        try: 
            if tag == 'a':
                for name, value in attrs:
                    if name == 'href':
                        if value[:5]=="http:":
                            self.links.append(value)
        except: 
            pass

But I dont want to get audio files, video files, etc. I only want to get html links. How can I do that?

ppaulojr
  • 3,579
  • 4
  • 29
  • 56
Ali Ismayilov
  • 1,707
  • 3
  • 22
  • 37

1 Answers1

3

I can check link ending and if it is particular format I can avoid appending that link to the list. Is there other way?

You could look at the 'Content-Type' header:

import urllib2
url = 'https://stackoverflow.com/questions/13431060/python-html-parsing'
req = urllib2.Request(url)
req.get_method = lambda : 'HEAD'    
response = urllib2.urlopen(req)
content_type = response.headers.getheader('Content-Type')
print(content_type)

yields

text/html; charset=utf-8

Many thanks to @JonClements for req.get_method = lambda : 'HEAD'. More info on this and alternate methods for sending a HEAD request can be found here.

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 1
    Instead of using `Range` - I'd probably go for `request = urllib2.Request(someurl); request.get_method = lambda : 'HEAD'; response = urllib2.urlopen(request)` and continue from there... – Jon Clements Nov 17 '12 at 14:43
  • @JonClements: Thank you very much for the info. I didn't know you could do that. – unutbu Nov 17 '12 at 14:45
  • @JonClements: What does it mean for `req.get_method()` to return `HEAD`? [The docs](http://docs.python.org/2/library/urllib2.html#urllib2.Request.get_method) seem to say it always returns `GET` or `POST`...? – unutbu Nov 17 '12 at 14:50
  • 1
    If a payload is present in the request, then `get_method` is `POST` otherwise it's a `GET` - replacing the method is a very kludgly way of writing `requests.head(url)`... – Jon Clements Nov 17 '12 at 14:54