0

Hi there I am trying to make python recognize ® as a symbol( if it doesn't show up that well here but it is the symbol with a capital R within a circle known as the 'registered' symbol) I understand that it is not recognized in python due to ASCII however i was wondering if anyone knows of a way to use a different decoding system that includes this symbol or a method to make python 'ignore' it.

For some context: I am trying to make an auto checkout program for a website so my program needs to match the item that the user wants. To do this I am using Beatifulsoup to scrape information however this symbol '®' is within the names of a few of the items causing python to crash. Here is the current command that I am using but is not working due to ASCII:

for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))

Any help would be appreciated

Here is the entirety of the program so far(ignore the mess nowhere near done):

import time
import webbrowser
from selenium import webdriver
import mechanize
from bs4 import BeautifulSoup
import urllib2
from selenium.webdriver.support.ui import Select

CnI = []
item = []
colour = []
Uhrefs = []
Whrefs = []
FinalColours = []
selectItemindex = []
selectColourindex = []

#counters
Ccounter = 0
Icounter = 0
Splitcounter = 1

#wanted items suffix options:jackets, shirts, tops_sweaters, sweatshirts,     pants, shorts, hats, bags, accessories, skate
suffix = 'accessories'
Wcolour = 'Black'
Witem = '2-Tone Nylon 6-Panel'

driver=webdriver.Chrome()
driver.get('http://www.supremenewyork.com/shop/all/'+suffix)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup)

for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
    CnI.append(str(colour.text))
    Uhrefs.append(str(colour.get('href')))
    print(colour)


print('#############')

for each in CnI:
    each.split(',')
    print(each)







while Splitcounter<=len(CnI):
    item.append(CnI[Splitcounter-1])
    FinalColours.append(CnI[Splitcounter])
    Whrefs.append(Uhrefs[Splitcounter])
    Splitcounter+=2

print(Uhrefs)

for each in item:
    print(each)

for z in FinalColours:
    print(z)

for i in Whrefs:
    print(i)

##for i in item:
##    hold = item.index(i)
##    print(hold)
##    if Witem == i and Wcolour == FinalColours[i]:
##        print('correct')
##
##


for count,elem in enumerate(item):
    if Witem in elem:
        selectItemindex.append(count+1)


for count,elem in enumerate(FinalColours):
    if Wcolour in elem:
        selectColourindex.append(count+1)
print(selectColourindex)
print(selectItemindex)


for each in selectColourindex:
    if selectColourindex[Ccounter] in selectItemindex:
        point = selectColourindex[Ccounter]
        print(point)
    else:
        Ccounter+=1
web = 'http://www.supremenewyork.com'+Whrefs[point-1]
driver.get(web)





elem1 = driver.find_element_by_name('commit')
elem1.click()

time.sleep(1)

elem2 = driver.find_element_by_link_text('view/edit basket')
elem2.click()
time.sleep(1)

elem3 = driver.find_element_by_link_text('checkout now')
elem3.click()
Reix
  • 37
  • 9

2 Answers2

2

"®" is not a character but a unicode codepoint so if you're using Python2, your code will never work. Instead of using str(), use something like this:

unicode(input_string, 'utf8')
# or
unicode(input_string, 'unicode-escape')

Edit: Given the code surrounding the initial snippet that was posted later and the fact that BeautifulSoup actually returns unicode already, it seems that removal of str() might be the best course of action and @MarkTolonen's answer is spot-on.

Srdjan Grubor
  • 2,605
  • 15
  • 17
2

BeautifulSoup returns Unicode strings. Stop converting them back to byte strings. Best practice when dealing with text is to:

  1. Decode incoming text to Unicode (what BeautifulSoup is doing).
  2. Process all text using Unicode.
  3. Encode outgoing text to Unicode (to file, to database, to sockets, etc.).

Small example of your issue:

text = u'\N{REGISTERED SIGN}'  # syntax to create a Unicode codepoint by name.
bytes = str(text)

Output:

Traceback (most recent call last):
  File "test.py", line 2, in <module>
    bytes = str(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 0: ordinal not in range(128)

Note the first line works and supports the character. Converting it to a byte string fails because it defaults to encoding in ASCII. You can explicitly encode it with another encoding (e.g. bytes = text.encode('utf8'), but that breaks rule 2 above and creates other issues.

Suggested reading:

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251