1

I have a link that has a special character ® like the link below. https://www.google.com/something®something

I get an error message that UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 68: ordinal not in range(128). I look up other posters but it only explains how to ignore the special characters or deal with one in the HTML body. I can't remove the special characters because I need that exact URL to extract data. How can I open that URL in the right way that I could use to extract the data?

2 Answers2

0

Try replacing the ® character by %C2%AE and it should work.

Dwij Sheth
  • 280
  • 1
  • 7
  • 20
0

If you have multiple links with the same issue, maybe something like this?

import urllib.parse

for link in new_links:
    url = link
    url = urllib.parse.urlsplit(url)
    url = list(url)
    '''
    url now looks like this:
    [
    'https', 
    'www.accessdata.fda.gov', 
    '/scripts/drugshortages/dsp_ActiveIngredientDetails.cfm',
    'AI=AVYCAZ®%20(ceftazidime%20and%20avibactam)%....', 
    ''
    ]
    '''
    url[3] = urllib.parse.quote(url[3]) 
    url = urllib.parse.urlunsplit(url)

    html = urlopen(url)

The key is the quote function, which replaces special characters in the string with their '%xx' code. You'll probably have to adapt the url[3] = ... line depending on your links.

Reference: https://stackoverflow.com/a/18269491/6601244

phi
  • 36
  • 1
  • 3