Beautifulsoup link(url) has a special character

Question

I have a link that has a special character ® like the link below. https://www.google.com/something®something

I get an error message that UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 68: ordinal not in range(128). I look up other posters but it only explains how to ignore the special characters or deal with one in the HTML body. I can't remove the special characters because I need that exact URL to extract data. How can I open that URL in the right way that I could use to extract the data?

Does this answer your question? [How to urlencode a querystring in Python?](https://stackoverflow.com/questions/5607551/how-to-urlencode-a-querystring-in-python) — sushanth, May 18 '20 at 05:27

score 0 · Accepted Answer · answered May 18 '20 at 05:24

0

Try replacing the ® character by %C2%AE and it should work.

answered May 18 '20 at 05:24

Dwij Sheth

280
1
7
20

2

ah I see thank you! – AnotherCoder May 18 '20 at 05:39

score 0 · Answer 2 · answered May 18 '20 at 05:37

If you have multiple links with the same issue, maybe something like this?

import urllib.parse

for link in new_links:
    url = link
    url = urllib.parse.urlsplit(url)
    url = list(url)
    '''
    url now looks like this:
    [
    'https', 
    'www.accessdata.fda.gov', 
    '/scripts/drugshortages/dsp_ActiveIngredientDetails.cfm',
    'AI=AVYCAZ®%20(ceftazidime%20and%20avibactam)%....', 
    ''
    ]
    '''
    url[3] = urllib.parse.quote(url[3]) 
    url = urllib.parse.urlunsplit(url)

    html = urlopen(url)

The key is the quote function, which replaces special characters in the string with their '%xx' code. You'll probably have to adapt the url[3] = ... line depending on your links.

Reference: https://stackoverflow.com/a/18269491/6601244

Beautifulsoup link(url) has a special character

2 Answers2