1

I'm scraping an XML sitemap which contains special characters like é, which results in

ERROR: Spider error processing <GET [URL with '%C3%A9' instead of 'é']>

How do I get Scrapy to keep the original URL as is, i.e. with the special character in it?

Scrapy==1.3.3

Python==3.5.2 (I need to stick to these versions)

Update: As per https://stackoverflow.com/a/17082272/6170115 I was able to get the URL with the correct character using unquote:

Example usage:

>>> from urllib.parse import unquote
>>> unquote('ros%C3%A9')
'rosé'

I also tried my own Request subclass without safe_url_string but I end up with:

UnicodeEncodeError: 'ascii' codec can't encode character '\xf9' in position 25: ordinal not in range(128)

Full traceback:

[scrapy.core.scraper] ERROR: Error downloading <GET [URL with characters like ù]>
Traceback (most recent call last):
  File "/usr/share/anaconda3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/__init__.py", line 65, in download_request
return handler.download_request(request, spider)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 61, in download_request
return agent.download_request(request)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 260, in download_request
agent = self._get_agent(request, timeout)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 241, in _get_agent
scheme = _parse(request.url)[0]
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 37, in _parse
return _parsed_url_args(parsed)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 19, in _parsed_url_args
path = b(path)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 17, in <lambda>
b = lambda s: to_bytes(s, encoding='ascii')
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/python.py", line 120, in to_bytes
return text.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character '\xf9' in position 25: ordinal not in range(128)

Any tips?

happyspace
  • 113
  • 1
  • 2
  • 12
  • 1
    Please have a look at my [answer](https://stackoverflow.com/questions/42445087/force-python-scrapy-not-to-encode-url) to a similar problem. Maybe you can apply that technique to your use case. – Frank Martin Oct 20 '17 at 09:09
  • The real problem was here: https://stackoverflow.com/questions/47563095/json-url-sometimes-returns-a-null-response And answer: https://stackoverflow.com/a/47564798/6170115 – happyspace Nov 30 '17 at 03:31

2 Answers2

1

I don't think you can do that as Scrapy uses safe_url_string from w3lib library before storing Requests URL. You would somehow have to reverse that.

Tomáš Linhart
  • 9,832
  • 1
  • 27
  • 39
0

You may use 'r' letter before your URL: url = r'name of that url'

Maksymilian Wojakowski
  • 5,011
  • 3
  • 19
  • 14