Python: urlopen() versus CURL

Question

I'm writing a webcrawler using Python and enjoying it a lot! But I've noticed some differences between the result produced by urlopen(url).read() on Python and by curl on terminal. I tried to install the pycurl module with no success. Is there a simple way to produce the CURL result on Python?

UPDATE

In this case I parsed this URL. I passed the same headers on both requests User-Agent: Mozilla/5.0. Here are the outputs:

cURL output: http://pastebin.com/PmmNhsbba
urlopen output: http://pastebin.com/7Wrt8pQZ

what differences - please post both outputs for a small page — amdixon, Nov 01 '15 at 03:58
**cURL:** http://pastebin.com/PmmNhsbb **Python:** http://pastebin.com/7Wrt8pQZ — bodruk, Nov 01 '15 at 04:11
I need to capture the elements with class `hproduct`. It is on the cURL version, but doesn't on the Python urlopen version. — bodruk, Nov 01 '15 at 04:13
It seems like the server is sending different content for the two requests. I suggest you see what headers cURL is using and try duplicating those with urlopen. — augurar, Nov 01 '15 at 04:56
I'm passing the same header on both requests `User-Agent: Mozilla/5.0`. — bodruk, Nov 01 '15 at 12:55

score 1 · Accepted Answer · edited Oct 20 '16 at 02:46

I know this is an old question but maybe the answer can be still useful.

I had the same problem and what I did to solve it was creating a php file which printed the request headers. Then I executed a curl and an urlopen and I checked the differences between the headers. You can find an example of that script in PHP docs.

In addition, you can go to your browser and check which headers are being send. Doing this I saw that urlopen sends connection: close instead of keep-alive.

So finally I add the keep-alive header and urlopen began to work as curl. This was my concrete problem but maybe yours is different due to the server requirements and you need to add or remove another header.

Python: urlopen() versus CURL

1 Answers1