3

I am scraping a page with Beautiful Soup, and the output contains non-standard Latin characters that are showing up as hex.

I am scraping https://www.archchinese.com. It contains pinyin words, which use non-standard latin characters (ǎ, ā, for example). I've been trying to loop through a series of links that contain pinyin, using the BeautifulSoup .string function along with utf-8 encoding to output these words. The word comes out with hex in the places of non-standard characters. The word "hǎo" comes out as "h\xc7\x8eo". I'm sure I'm doing something wrong with encoding it, but I don't know enough to know what to fix. I tried decoding with utf-8 first, but I'm getting an error that the element has no decode function. Trying to print the string without encoding gives me an error about the characters being undefined, which, I figure, is because they need to be encoded to something first.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re

url = "https://www.archchinese.com/"

driver = webdriver.Chrome() #Set selenium up for opening page with Chrome.
driver.implicitly_wait(30)
driver.get(url)

driver.find_element_by_id('dictSearch').send_keys('好') # This character is hǎo.

python_button = driver.find_element_by_id('dictSearchBtn')
python_button.click() # Look for submit button and click it.

soup=BeautifulSoup(driver.page_source, 'lxml')

div = soup.find(id='charDef') # Find div with the target links.

for a in div.find_all('a', attrs={'class': 'arch-pinyin-font'}):
    print (a.string.encode('utf-8')) # Loop through all links with pinyin and attempt to encode.

Actual results: b'h\xc7\x8eo' b'h\xc3\xa0o'

Expected results: hǎo hào

EDIT: The problem seems to be related to the UnicodeEncodeError in Windows. I've tried to install win-unicode-console, but no luck. Thanks to snakecharmerb for the info.

sophros
  • 14,672
  • 11
  • 46
  • 75
ep84
  • 315
  • 2
  • 17

2 Answers2

2

You don't need to encode the values when printing - the print function will take care of this automatically. Right now, you're printing the representation of the bytes that make up the encoded value rather than just the string itself.

>>> s = 'hǎo'
>>> print(s)
hǎo

>>> print(s.encode('utf-8'))
b'h\xc7\x8eo'
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • Tried print(a) with no encoding, and I get the same as if I printed a.string without encoding: Traceback (most recent call last): File "hanziscrape.py", line 22, in print (a) File "C:\Users\root\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u01ce' in position 177: character maps to – ep84 Dec 22 '18 at 19:17
  • 1
    Good old Windows. [This answer](https://stackoverflow.com/a/32176732/5320906) may help. – snakecharmerb Dec 22 '18 at 19:21
  • Yeah, I had already come across that thing and installed win-unicode-console through pip before. I tried again and got C:\Users\root>pip install win-unicode-console Requirement already satisfied: win-unicode-console in c:\users\root\appdata\local\programs\python\python37\lib\site-packages (0.5) – ep84 Dec 22 '18 at 19:26
  • 1
    I don't have a windows box to hand so I can't really help further. But I'd recommend editing your question to make clear that the your problem is the `UnicodeEncodeError` when printing to the windows console, and the steps you've taken to try to address it. – snakecharmerb Dec 22 '18 at 19:31
  • 1
    Turns out I was using the Git console on Windows, and that was the factor. Your suggestion works perfectly. – ep84 Dec 22 '18 at 19:51
1

Use encode while you are calling BeautifulSoup, not after.

soup=BeautifulSoup(driver.page_source.encode('utf-8'), 'lxml')

div = soup.find(id='charDef') # Find div with the target links.

for a in div.find_all('a', attrs={'class': 'arch-pinyin-font'}):
    print (a.string)
nandu kk
  • 368
  • 1
  • 10