1

I have a scraper, which queries different websites. Some of them varyingly use Content-Encoding. And since I'm trying to simulate an AJAX query and need to mimic Mozilla, I need full support. There are multiple HTTP libraries for Python, but neither seems complete:

httplib seems pretty low level, more like a HTTP packet sniffer really.

urllib2 is some sort of elaborate hoax. There are a dozen handlers for various web client functions, but mandatory HTTP features like Content-Encoding appearantly aren't.

mechanize: is nice, already somehwat overkill for my tasks, but only supports CE 'gzip'.

httplib2: sounded most promising, but actually fails on 'deflate' encoding, because of the disparity of raw deflate and zlib streams.

So are there any other options? I can't believe I'm expected to reimplement workarounds for above libraries. And it's not a good idea to distribute patched versions alongside my application, because packagers might remove it again if the according library is available as separate distribution package.

I almost don't dare to say, but the http functions API in PHP is much nicer. And besides Content-Encoding:*, I might somewhen need multipart/form-data too. So, is there a comprehensive 3rd party library for http retrieval?

mario
  • 144,265
  • 20
  • 237
  • 291
  • 1
    Second question is a duplicate of http://stackoverflow.com/questions/680305/using-multipartposthandler-to-post-form-data-with-python – Metalshark Jul 11 '10 at 14:12
  • @Metalshark: that poster module seems cool+simple. bookmarked. thanks! – mario Jul 11 '10 at 14:29

2 Answers2

1

I would consider either invoking a child process of cURL or using python bindings for libcurl.

From this description cURL seems to support gzip and deflate.

Peter Lyons
  • 142,938
  • 30
  • 279
  • 274
  • I prefer wget over curl for cmdline work, and thus was a little relunctant because PycURL is also a non-standard extension. But it's probably the most mature and feature-complete solution around, so really the best choice. – mario Jul 14 '10 at 11:39
-1

Beautiful Soup might work. Just throwing it out there.

karlw
  • 668
  • 4
  • 13