0

I have the following url link which has accented charaters:

https://www.janes.com/...tamandaré... etc.

When I try to request the link, I get the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 89: invalid continuation byte

This is my code:

import requests

def request_site(url):
    return requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0)'})

if __name__ == '__main__':
    url = 'https://www.janes.com/article/87665/laad-2019-united-kingdom-s-sea-signs-mou-with-brazilian-siatt-for-tamandaré-class-corvette-torpedo-tubes'
    print(request_site(url))

The full error:

Traceback (most recent call last):
  File "D:/OneDrive/PhD/Web Crawler/playground.py", line 104, in <module>
    print(request_site(url))
  File "D:/OneDrive/PhD/Web Crawler/playground.py", line 73, in request_site
    return requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0)'})
  File "C:\Python35\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python35\lib\site-packages\requests\api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Python35\lib\site-packages\requests\sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python35\lib\site-packages\requests\sessions.py", line 668, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "C:\Python35\lib\site-packages\requests\sessions.py", line 668, in <listcomp>
    history = [resp for resp in gen] if allow_redirects else []
  File "C:\Python35\lib\site-packages\requests\sessions.py", line 149, in resolve_redirects
    url = self.get_redirect_target(resp)
  File "C:\Python35\lib\site-packages\requests\sessions.py", line 115, in get_redirect_target
    return to_native_string(location, 'utf8')
  File "C:\Python35\lib\site-packages\requests\_internal_utils.py", line 25, in to_native_string
    out = string.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 89: invalid continuation byte

I found many questions (like link) that are similar, but non of them propose a solution for the same problem, as well as, all of the previous solutions are for python2.

Minions
  • 5,104
  • 5
  • 50
  • 91
  • Possible duplicate of [How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"](https://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byte) – ggorlen Apr 19 '19 at 15:58
  • @ggorlen, plz check the question .. they are different – Minions Apr 19 '19 at 16:07

1 Answers1

1

Just a quick encode is needed, but you'll need to drop the http:// from url as it'll encode that as well:

import requests

def request_site(url):
    return requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0)'})

if __name__ == '__main__':
    url = 'www.janes.com/article/87665/laad-2019-united-kingdom-s-sea-signs-mou-with-brazilian-siatt-for-tamandaré-class-corvette-torpedo-tubes'
    url_encode = 'http://' + urllib.parse.quote(url.encode('latin-1'))
    print(request_site(url_encode))
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • I got error from another site: https://velká-bíteš-takes-aim-at-indian-requirements-with-pbs-tj150-uav-engine, is there a more universal encoding to handle other kind of chars? – Minions Apr 19 '19 at 16:59
  • Hmm. Not sure. I’ll have to look at this one. I’ll get back to you. – chitown88 Apr 19 '19 at 20:50