python: how to fetch an url? (with improper response headers)

Question

I want to build a small script in python which needs to fetch an url. The server is a kind of crappy though and replies pure ASCII without any headers.

When I try:

import urllib.request
response = urllib.request.urlopen(url)
print(response.read())

I obtain a http.client.BadStatusLine: 100 error because this isn't a properly formatted HTTP response.

Is there another way to fetch an url and get the raw content, without trying to parse the response?

Thanks

I'm using python 3, and urllib2 isn't installed by default there. I think it's for python2, but correct me if I'm wrong. To my understanding, the behavior would also be the same, as urllib2 also parses the response (feel free to correct me if I am mistaken). — dagnelies, Apr 11 '12 at 14:27
Looks like `urllib` in `python3.x` is the same as `urllib2` in `python2.x`. Have you tries making a `URLopener` object, then using one of its `open` methods (use `help(urllib)` to find out more) - while I don't have `python3.x` or access to the data you are testing against, the docs say nothing about headers on this, whereas the `request` method does, explicitly. The `requests` module is widely lauded though if it is useful for this. `open_data, open_file, open_ftp, open_http, open_https` — theheadofabroom, Apr 11 '12 at 15:42
Disregard most of that last comment - I am mixing up content from `urllib` and `urllib2` - just check the docs for what you have - it's generally fairly clear — theheadofabroom, Apr 11 '12 at 15:47

score 1 · Answer 1 · edited May 23 '17 at 12:20

What you need to do in this case is send a raw HTTP request using sockets.
You would need to do a bit of low level network programming using the socket python module in this case. (Network sockets actually return you all the information sent by the server as it as, so you can accordingly interpret the response as you wish. For example, the HTTP protocol interprets the response in terms of standard HTTP headers - GET, POST, HEAD, etc. The high-level module urllib hides this header information from you and just returns you the data.)

You also need to have some basic information about HTTP headers. For your case, you just need to know about the GET HTTP request. See its definition here - http://djce.org.uk/dumprequest, see an example of it here - http://en.wikipedia.org/wiki/HTTP#Example_session. (If you wish to capture live traces of HTTP requests sent from your browser, you would need a packet sniffing software like wireshark.)

Once you know basics about socket module and HTTP headers, you can go through this example - http://coding.debuntu.org/python-socket-simple-tcp-client which tells you how to send a HTTP request over a socket to a server and read its reply back. You can also refer to this unclear question on SO.

(You can google python socket http to get more examples.)

(Tip: I am not a Java fan, but still, if you don't find enough convincing examples on this topic under python, try finding it under Java, and then accordingly translate it to python.)

Marty · Accepted Answer · 2012-04-12T16:50:30.957

It's difficult to answer your direct question without a bit more information; not knowing exactly how the (web) server in question is broken.

That said, you might try using something a bit lower-level, a socket for example. Here's one way (python2.x style, and untested):

#!/usr/bin/env python
import socket                                                                  
from urlparse import urlparse                                                  

def geturl(url, timeout=10, receive_buffer=4096):                              
    parsed = urlparse(url)                                                     
    try:                                                                       
        host, port = parsed.netloc.split(':')                                  
    except ValueError:                                                         
        host, port = parsed.netloc, 80                                         

    sock = socket.create_connection((host, port), timeout)                     
    sock.sendall('GET %s HTTP/1.0\n\n' % parsed.path)                          

    response = [sock.recv(receive_buffer)]                                     
    while response[-1]:                                                        
        response.append(sock.recv(receive_buffer))                             

    return ''.join(response)  

print geturl('http://www.example.com/') #<- the trailing / is needed if no 
                                            other path element is present

And here's a stab at a python3.2 conversion (you may not need to decode from bytes, if writing the response to a file for example):

#!/usr/bin/env python
import socket                                                                  
from urllib.parse import urlparse                                                  

ENCODING = 'ascii'

def geturl(url, timeout=10, receive_buffer=4096):                              
    parsed = urlparse(url)                                                     
    try:                                                                       
        host, port = parsed.netloc.split(':')                                  
    except ValueError:                                                         
        host, port = parsed.netloc, 80                                         

    sock = socket.create_connection((host, port), timeout)                     

    method  = 'GET %s HTTP/1.0\n\n' % parsed.path
    sock.sendall(bytes(method, ENCODING))

    response = [sock.recv(receive_buffer)]                                     
    while response[-1]:                                                        
        response.append(sock.recv(receive_buffer))                             

    return ''.join(r.decode(ENCODING) for r in response)

print(geturl('http://www.example.com/'))

HTH!

Edit: You may need to adjust what you put in the request, depending on the web server in question. Guanidene's excellent answer provides several resources to guide you on that path.

...this isn't working yet but definitely on the right track... thanks — dagnelies, Apr 12 '12 at 08:04
Great! Feel free to share what's not working if you think we can help. — Marty, Apr 12 '12 at 16:42
I just had to tweak a little the request header. The targeted server is a quite crazy beast. :) — dagnelies, Apr 13 '12 at 07:36

score 0 · Answer 3 · answered Apr 11 '12 at 14:25

0

urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg')

answered Apr 11 '12 at 14:25

user850498

717
1
9
22

also parses the response and results in a http.client.BadStatusLine: 100 – dagnelies Apr 11 '12 at 14:28
it's corporate stuff, sry (but i can see the output when pasting the url in firefox for instance) – dagnelies Apr 11 '12 at 14:32
`urlretrieve` should just put what the server send in to a file. change `'abc.jpg'` to `'abc.txt'` – user850498 Apr 11 '12 at 14:40
Well, what should I say, it doesn't! :p ...it checks the response and results in a `urllib.error.URLError: ` – dagnelies Apr 11 '12 at 14:45

python: how to fetch an url? (with improper response headers)

3 Answers3