2

I started learning Python recently, and as a sort of challenge/project, I decided to try and create a "most common word finder."

To do this, I am using a website called Jisho, specifically, the #kanji pages. (This is the page I am using to test my code.) From these pages, the finder will look at the on and kun reading compounds (which are in the ul class no-bullet), and then find and print the most common English word from this.

For code help, this blog post is mainly what I am using. VS Code is my IDE.

I have currently imported urllib.parse, requests, and BeautifulSoup from bs4, and my code currently looks like this:

kanji = '人'
parsed_kanji = urllib.parse.quote(kanji)

url = f'https://jisho.org/search/{parsed_kanji}%20%23kanji'

page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")

compounds = []
for li in soup.select('.no-bullet li'):
    comp = ' '.join(li.text.split())
    compounds.append(comp)
print(compounds)

(The code to find the most common word is not included.)

Everything works fine when print(compounds) is not there, but when it is included, I get the following error message:

Traceback (most recent call last):
    File "c:\Users\Lugnut\OneDrive\Desktop\frequent\most_common\test_list.py", line 22, in <module>
        print(compounds)
    File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u4eba' in position 2: character maps to <undefined>

Why is it that the print() function causes my code to break?

S.B
  • 13,077
  • 10
  • 22
  • 49
Lugnut
  • 59
  • 5
  • 2
    It means that your console is using an encoding that can't handle a character that came from the web site. – Mark Ransom Aug 30 '22 at 21:06
  • Does it work if you print each element of compoinds separately instead of printing the entire list? I don't have a complete answer yet, but some searching on the error showed it may be related to the codeset your terminal is set to use. And apparently the UnicodeEncodeError can sometimes actually happen while it's decoding. – nigh_anxiety Aug 30 '22 at 21:08
  • What do you get if you do `print(sys.stdout.encoding)`? – Mark Ransom Aug 30 '22 at 21:10
  • 1
    @MarkRansom If its console is not able to handle that character, wouldn't it show question marks or other non-sense characters instead ? – S.B Aug 30 '22 at 21:13
  • @S.B No - because the output never even gets to the terminal. OP didn't specify what version of Python they are running. In Python 2, the default encoding is ASCII, and all `str` are bytes; Unicode code points were stored in a `unicode` type. In Python 3, `str` is now Unicode code points, and encoded strings are now stored in type `bytes`, with a default encoding of UTF-8. The default error handling for both `str.encode()` and `bytes.decode()` methods, which are happening behind the scenes here is `'strict'`, which means it always raises an exception. – nigh_anxiety Aug 30 '22 at 22:23
  • @nigh_anxiety I just looked more closely at the error message, and it's using Python 3.9 with `cp1252`. Now every version of Python on Windows from 3.6 on bypasses `str.encode` to write Unicode directly to the console. So I have to assume they've redirected output to a file instead. – Mark Ransom Aug 31 '22 at 00:00
  • @nigh_anxiety I am not sure exactly what you mean by printing each element of compounds separately, but I tried printing compounds by doing `print(compounds[0]) print(compounds[1]) print(compounds[2])` ... This resulted in the same UnicodeEncodeError, except in `position 0` instead of `position 2`. I also tried unpacking with the asterisk operator, list comprehension, for loop, for loop w/ enumerate, the join() function, slicing syntax, and list.append. For all of these alternate methods, the UnicodeEncodeError was in `position 0`, except for slicing syntax and list.append, in `position 2'. – Lugnut Aug 31 '22 at 13:20
  • @MarkRanson As for your other comment, unless there is a way to redirect output to a file unknowingly, I have not done this -- unless `page.text` somehow messes this up? – Lugnut Aug 31 '22 at 13:29
  • @lugnut - that's basically what I mean by printing separately (or doing a `for compound in compounds: print(compound)` loop). What terminal are you using? PowerShell vs bash vs Python terminal vs others? – nigh_anxiety Aug 31 '22 at 16:30
  • Based on this SO post, the problem may be the font you have selected for your terminal only supports the Win-1252 page set, so you need to change fonts or setup a fallback font for additional characters. Or use the `chcp` command to change to a codeset that supports UTF-8 https://stackoverflow.com/questions/57612504/how-to-correctly-display-unicode-characters-in-vs-codes-integrated-terminal – nigh_anxiety Aug 31 '22 at 16:47
  • @nigh_anxiety fonts and encodings are two different things. Don't conflate them. `str.encode` only cares about the encoding, not the font. – Mark Ransom Aug 31 '22 at 17:04

1 Answers1

2

Originally, by using sys.stdout.reconfigure(encoding='utf-8') in the file, I was able to get rid of the UnicodeEncodeError.

But, by setting the system locale to UTF-8 through this answer, and setting my font to the TrueType MS Mincho and switching my console window's code page to 65001 in VS Code (via chcp 65001) through this answer, I was able to more permanently solve the UnicodeEncodeError. (Not having to use sys.stdout.reconfigure(encoding='utf-8-') every time I want to use this kind of code.)

Lugnut
  • 59
  • 5