python html parsing

Question

I have the following problem:

I would like to parse html files and get links from the html file. I can get links with the following code:

class MyHTMLParser(HTMLParser):
    links=[]
    def __init__(self,url):
        HTMLParser.__init__(self)
        self.url = url

    def handle_starttag(self, tag, attrs):
        try: 
            if tag == 'a':
                for name, value in attrs:
                    if name == 'href':
                        if value[:5]=="http:":
                            self.links.append(value)
        except: 
            pass

But I dont want to get audio files, video files, etc. I only want to get html links. How can I do that?

I can check link ending and if it is particular format I can avoid appending that link to the list. Is there other way? — Ali Ismayilov, Nov 17 '12 at 14:02
http://stackoverflow.com/questions/717541/parsing-html-in-python?rq=1 — ppaulojr, Nov 17 '12 at 14:19

score 3 · Accepted Answer · edited May 23 '17 at 12:27

3

I can check link ending and if it is particular format I can avoid appending that link to the list. Is there other way?

You could look at the 'Content-Type' header:

import urllib2
url = 'https://stackoverflow.com/questions/13431060/python-html-parsing'
req = urllib2.Request(url)
req.get_method = lambda : 'HEAD'    
response = urllib2.urlopen(req)
content_type = response.headers.getheader('Content-Type')
print(content_type)

yields

text/html; charset=utf-8

Many thanks to @JonClements for req.get_method = lambda : 'HEAD'. More info on this and alternate methods for sending a HEAD request can be found here.

edited May 23 '17 at 12:27

Community

1
1

answered Nov 17 '12 at 14:20

unutbu

842,883
184
1,785
1,677

1

Instead of using `Range` - I'd probably go for `request = urllib2.Request(someurl); request.get_method = lambda : 'HEAD'; response = urllib2.urlopen(request)` and continue from there... – Jon Clements Nov 17 '12 at 14:43
@JonClements: Thank you very much for the info. I didn't know you could do that. – unutbu Nov 17 '12 at 14:45
@JonClements: What does it mean for `req.get_method()` to return `HEAD`? [The docs](http://docs.python.org/2/library/urllib2.html#urllib2.Request.get_method) seem to say it always returns `GET` or `POST`...? – unutbu Nov 17 '12 at 14:50
1

If a payload is present in the request, then `get_method` is `POST` otherwise it's a `GET` - replacing the method is a very kludgly way of writing `requests.head(url)`... – Jon Clements Nov 17 '12 at 14:54

python html parsing

1 Answers1