10

I would like Scrapy to not URL encode my Requests. I see that scrapy.http.Request is importing scrapy.utils.url which imports w3lib.url which contains the variable _ALWAYS_SAFE_BYTES. I just need to add a set of characters to _ALWAYS_SAFE_BYTES but I am not sure how to do that from within my spider class.

scrapy.http.Request relevant line:

fp.update(canonicalize_url(request.url))

canonicalize_url is from scrapy.utils.url, relevant line in scrapy.utils.url:

path = safe_url_string(_unquotepath(path)) or '/'

safe_url_string() is from w3lib.url, relevant lines in w3lib.url:

_ALWAYS_SAFE_BYTES = (b'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-')

within w3lib.url.safe_url_string():

_safe_chars = _ALWAYS_SAFE_BYTES + b'%' + _reserved + _unreserved_marks
return moves.urllib.parse.quote(s, _safe_chars)
flyingtriangle
  • 103
  • 1
  • 5
  • 1
    I'm facing this problem because a web server accepts comma only unencoded, but Scrapy translates it in links into %2C. – Seppo Enarvi Nov 19 '14 at 11:57
  • I needed to quickly work around the problem, so I added self._url = self._url.replace('%2C', ',') into Request._set_url(). Removing the safe_url_string(url) call from the same function didn't help. – Seppo Enarvi Nov 19 '14 at 19:15
  • Any solution? ... I need it – Umair Ayub Feb 24 '17 at 16:26

1 Answers1

3

I wanted to not to encode [ and ] and I did this.

When creating a Request object scrapy applies some url encoding methods. To revert these you can utilize a custom middleware and change the url to your needs.

You could use a Downloader Middleware like this:

class MyCustomDownloaderMiddleware(object):

    def process_request(self, request, spider):
        request._url = request.url.replace("%5B", "[", 2)
        request._url = request.url.replace("%5D", "]", 2)

Don't forget to "activate" the middleware in settings.py like so:

DOWNLOADER_MIDDLEWARES = {
    'so.middlewares.MyCustomDownloaderMiddleware': 900,
}

My project is named so and in the folder there is a file middlewares.py. You need to adjust those to your environment.

Credit goes to: Frank Martin

Community
  • 1
  • 1
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146