how to get python to recognize the ® symbol

Question

Hi there I am trying to make python recognize ® as a symbol( if it doesn't show up that well here but it is the symbol with a capital R within a circle known as the 'registered' symbol) I understand that it is not recognized in python due to ASCII however i was wondering if anyone knows of a way to use a different decoding system that includes this symbol or a method to make python 'ignore' it.

For some context: I am trying to make an auto checkout program for a website so my program needs to match the item that the user wants. To do this I am using Beatifulsoup to scrape information however this symbol '®' is within the names of a few of the items causing python to crash. Here is the current command that I am using but is not working due to ASCII:

for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))

Any help would be appreciated

Here is the entirety of the program so far(ignore the mess nowhere near done):

import time
import webbrowser
from selenium import webdriver
import mechanize
from bs4 import BeautifulSoup
import urllib2
from selenium.webdriver.support.ui import Select

CnI = []
item = []
colour = []
Uhrefs = []
Whrefs = []
FinalColours = []
selectItemindex = []
selectColourindex = []

#counters
Ccounter = 0
Icounter = 0
Splitcounter = 1

#wanted items suffix options:jackets, shirts, tops_sweaters, sweatshirts,     pants, shorts, hats, bags, accessories, skate
suffix = 'accessories'
Wcolour = 'Black'
Witem = '2-Tone Nylon 6-Panel'

driver=webdriver.Chrome()
driver.get('http://www.supremenewyork.com/shop/all/'+suffix)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup)

for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
    CnI.append(str(colour.text))
    Uhrefs.append(str(colour.get('href')))
    print(colour)


print('#############')

for each in CnI:
    each.split(',')
    print(each)







while Splitcounter<=len(CnI):
    item.append(CnI[Splitcounter-1])
    FinalColours.append(CnI[Splitcounter])
    Whrefs.append(Uhrefs[Splitcounter])
    Splitcounter+=2

print(Uhrefs)

for each in item:
    print(each)

for z in FinalColours:
    print(z)

for i in Whrefs:
    print(i)

##for i in item:
##    hold = item.index(i)
##    print(hold)
##    if Witem == i and Wcolour == FinalColours[i]:
##        print('correct')
##
##


for count,elem in enumerate(item):
    if Witem in elem:
        selectItemindex.append(count+1)


for count,elem in enumerate(FinalColours):
    if Wcolour in elem:
        selectColourindex.append(count+1)
print(selectColourindex)
print(selectItemindex)


for each in selectColourindex:
    if selectColourindex[Ccounter] in selectItemindex:
        point = selectColourindex[Ccounter]
        print(point)
    else:
        Ccounter+=1
web = 'http://www.supremenewyork.com'+Whrefs[point-1]
driver.get(web)





elem1 = driver.find_element_by_name('commit')
elem1.click()

time.sleep(1)

elem2 = driver.find_element_by_link_text('view/edit basket')
elem2.click()
time.sleep(1)

elem3 = driver.find_element_by_link_text('checkout now')
elem3.click()

If I copy the symbol in your post, I can work with it just fine in a string — bphi, May 09 '18 at 15:09
Please post the code that is "crashing" with the full stack trace. — BoarGules, May 09 '18 at 15:13
@bphi yeah i have to use 2.7 due to the addons i need for my program i believe python v3 onwards usded utf8 — Reix, May 09 '18 at 15:13
Your problem is considerably bigger than '®' since the contents returned from the web page you're scraping could easily be in an entirely non-ASCII-representable alphabet. You should design your code around unicode, not ASCII, item names. It is *never* safe to assume you're extracting ASCII from a web page (ask me how I know). — Larry Lustig, May 09 '18 at 15:22
Just remove `str()` everywhere and learn to use Unicode strings. — Mark Tolonen, May 09 '18 at 15:52

Srdjan Grubor · Accepted Answer · 2018-05-09T16:20:46.013

2

"®" is not a character but a unicode codepoint so if you're using Python2, your code will never work. Instead of using str(), use something like this:

unicode(input_string, 'utf8')
# or
unicode(input_string, 'unicode-escape')

Edit: Given the code surrounding the initial snippet that was posted later and the fact that BeautifulSoup actually returns unicode already, it seems that removal of str() might be the best course of action and @MarkTolonen's answer is spot-on.

edited May 09 '18 at 16:20

answered May 09 '18 at 15:13

Srdjan Grubor

2,605
15
17

1

`raw_input` is a builtin function in python2, could be confusing – avigil May 09 '18 at 15:22
@avigil Great point - fixed! – Srdjan Grubor May 09 '18 at 15:24
1

Your code *converts* a byte string to Unicode. The OP already has Unicode strings from BeautifulSoup. – Mark Tolonen May 09 '18 at 15:50

score 2 · Answer 2 · answered May 09 '18 at 15:46

BeautifulSoup returns Unicode strings. Stop converting them back to byte strings. Best practice when dealing with text is to:

Decode incoming text to Unicode (what BeautifulSoup is doing).
Process all text using Unicode.
Encode outgoing text to Unicode (to file, to database, to sockets, etc.).

Small example of your issue:

text = u'\N{REGISTERED SIGN}'  # syntax to create a Unicode codepoint by name.
bytes = str(text)

Output:

Traceback (most recent call last):
  File "test.py", line 2, in <module>
    bytes = str(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 0: ordinal not in range(128)

Note the first line works and supports the character. Converting it to a byte string fails because it defaults to encoding in ASCII. You can explicitly encode it with another encoding (e.g. bytes = text.encode('utf8'), but that breaks rule 2 above and creates other issues.

how to get python to recognize the ® symbol

2 Answers2