Download a URL only if it is a HTML Webpage

Question

I want to write a python script which downloads the web-page only if the web-page contains HTML. I know that content-type in header will be used. Please suggest someway to do it as i am unable to get a way to get header before the file download.

@NiklasB. I have explored the request object and tried the retrieve function bu it creates a file on the file system first and returns the email.mimetype object. But i want to download the file only if the content is HTML — chinmayaposwalia, Mar 17 '12 at 13:58
Have a look at [this question](http://stackoverflow.com/questions/843392/python-get-http-headers-from-urllib-call) — Lev Levitsky, Mar 17 '12 at 14:12

score 2 · Accepted Answer · answered Mar 17 '12 at 14:16

2

Use http.client to send a HEAD request to the URL. This will return only the headers for the resource then you can look at the content-type header and see if it text/html. If it is then send a GET request to the URL to get the body.

answered Mar 17 '12 at 14:16

Lance Helsten

9,457
3
16
16

Download a URL only if it is a HTML Webpage

1 Answers1