2

I'm writing a webcrawler using Python and enjoying it a lot! But I've noticed some differences between the result produced by urlopen(url).read() on Python and by curl on terminal. I tried to install the pycurl module with no success. Is there a simple way to produce the CURL result on Python?

UPDATE

In this case I parsed this URL. I passed the same headers on both requests User-Agent: Mozilla/5.0. Here are the outputs:

Community
  • 1
  • 1
bodruk
  • 3,242
  • 8
  • 34
  • 52
  • what differences - please post both outputs for a small page – amdixon Nov 01 '15 at 03:58
  • **cURL:** http://pastebin.com/PmmNhsbb **Python:** http://pastebin.com/7Wrt8pQZ – bodruk Nov 01 '15 at 04:11
  • I need to capture the elements with class `hproduct`. It is on the cURL version, but doesn't on the Python urlopen version. – bodruk Nov 01 '15 at 04:13
  • 1
    It seems like the server is sending different content for the two requests. I suggest you see what headers cURL is using and try duplicating those with urlopen. – augurar Nov 01 '15 at 04:56
  • I'm passing the same header on both requests `User-Agent: Mozilla/5.0`. – bodruk Nov 01 '15 at 12:55

1 Answers1

1

I know this is an old question but maybe the answer can be still useful.

I had the same problem and what I did to solve it was creating a php file which printed the request headers. Then I executed a curl and an urlopen and I checked the differences between the headers. You can find an example of that script in PHP docs.

In addition, you can go to your browser and check which headers are being send. Doing this I saw that urlopen sends connection: close instead of keep-alive.

So finally I add the keep-alive header and urlopen began to work as curl. This was my concrete problem but maybe yours is different due to the server requirements and you need to add or remove another header.

jkdev
  • 11,360
  • 15
  • 54
  • 77
Iván Rodríguez Torres
  • 4,293
  • 3
  • 31
  • 47