0

Before someone says this is a duplicate question, I just want to let you know that the error I am getting from running this program in command line is different from all the other related questions I've seen.

I am trying to run a very short script in Python

from bs4 import BeautifulSoup
import urllib.request




html = urllib.request.urlopen("http://dictionary.reference.com/browse/word?s=t").read().strip()
dhtml = str(html, "utf-8").strip()
soup = BeautifulSoup(dhtml.strip(), "html.parser")
print(soup.prettify())

But I keep getting an error when I run this program with python.exe. UnicodeEncodeError: 'charmap' codec can't encode character '\u025c. I have tried a lot of methods to get around this, but I managed to isolate it to the problem of converting bytes to strings. When I run this program in IDLE, I get the HTML as expected. What is it that IDLE is automatically doing? Can I use IDLE's interpretation program instead of python.exe? Thanks!

EDIT:

My problem is caused by print(soup.prettify()) but type(soup.prettify()) returns str?

RESOLVED:

I finally made a decision to use encode() and decode() because of the trouble that has been caused. If someone knows how to actually resolve a question, please do; also, thank you for all your answers

Community
  • 1
  • 1
rassa45
  • 3,482
  • 1
  • 29
  • 43
  • Not seeing a character encoding declared on that page. – stark Jul 18 '15 at 18:51
  • Do a ctrl+f for charset please – rassa45 Jul 18 '15 at 18:53
  • I think it is the first meta tag in `head` – rassa45 Jul 18 '15 at 18:53
  • You can also find out the encoding from BeautifulSoup – rassa45 Jul 18 '15 at 18:54
  • Sorry-you're right. Anyway, html validator shows 77 errors. https://validator.w3.org/nu/?doc=http%3A%2F%2Fdictionary.reference.com%2Fbrowse%2Fword%3Fs%3Dt – stark Jul 18 '15 at 18:58
  • Sir, my problem is not that I cannot find the code I need. My problem is that I have to add the method `.encode("utf-8")` to each element of BeautifulSoup I try to print out. Apologies for the confusion. I am simply trying to extract the definitions by identifying the relevant divs (which I am able to do) and print them out (which I need an easier solution to because I cannot do `.encode("utf-8")` for a lot of elements. I prefer to just print them out using the given method). – rassa45 Jul 19 '15 at 00:58
  • Also, the html validator mainly gives errors on unescaped links and the head section; both of which are not relevant to my project. Thank you for your help, however. – rassa45 Jul 19 '15 at 00:59
  • don't put your answer into the question, [post it as an answer instead](http://stackoverflow.com/help/self-answer) so that you can [accept it](http://stackoverflow.com/help/accepted-answer). – jfs Jul 22 '15 at 21:08

3 Answers3

3

UnicodeEncodeError: 'charmap' codec can't encode character '\u025c'

The console character encoding can't represent '\u025c' i.e., "ɜ" Unicode character (U+025C LATIN SMALL LETTER REVERSED OPEN E).

What is it that IDLE is automatically doing?

IDLE displays Unicode directly (only BMP characters) if the corresponding font supports given Unicode characters.

Can I use IDLE's interpretation program instead of python.exe

Yes, run:

T:\> py -midlelib -r your_script.py

Note: you could write arbitrary Unicode characters to the Windows console if Unicode API is used:

T:\> py -mpip install win-unicode-console
T:\> py -mrun your_script.py

See What's the deal with Python 3.4, Unicode, different languages and Windows?

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • 1
    I'd change "display arbitrary Unicode characters in Windows console" to something like "write Unicode to the console". Available fonts depend on the Windows locale, and the console window doesn't support mixing halfwidth characters with fullwidth characters (CJK), i.e. a character can't map to 2 cells. It's also limited to the BMP because each cell stores a single `wchar_t` code, which excludes using UTF-16 surrogate pairs. – Eryk Sun Jul 19 '15 at 16:13
  • 1
    However, those limits are due to how conhost.exe works, not the console API itself. You can actually hide the window that conhost.exe creates and instead display the console screen buffer in a window that has more flexible font support. That's what [ConEmu](http://conemu.github.io) does. – Eryk Sun Jul 19 '15 at 16:18
  • @eryksun: yes, astral characters are displayed as boxes even if the font supports the characters. If you copy the boxes and paste into e.g., notepad then the characters are shown correctly. – jfs Jul 19 '15 at 18:39
  • One more question, I have both Python 3 and Python27, how do I access the Python 3 program? – rassa45 Jul 20 '15 at 15:05
  • Configure `py` to start Python 3 by default (unless you've changed it; it is probably the default already). Or [specify the shebang](http://stackoverflow.com/a/14188640/4279) or call `py -3` explicitly. – jfs Jul 20 '15 at 15:20
  • I "specified the shebang" but I am looking to expand this program and test it on a localhost server. However, I doubt this will work if the program opens IDLE shell when running the program. Is there a way to simply interpret and run the program and get the output with opening IDLE? – rassa45 Jul 20 '15 at 17:33
  • Also, your unicode solution works in command line but not from Github Atom – rassa45 Jul 20 '15 at 17:38
  • @ytpillai: 0. shebang is interpreted by `py` launcher (Windows) or [by the kernel on POSIX systems (`exec()` call)](http://stackoverflow.com/q/3009192/4279) 1. If you don't want to start IDLE, do not start it. 2. `py -mrun` expects that the script is run in Windows console without redirecting stdout. If stdout is redirected then use `PYTHONIOENCODING` instead -- [click the last link in the answer](http://stackoverflow.com/a/30551552/4279). – jfs Jul 20 '15 at 18:31
  • @ytpillai: If you can't run your script within Github Atom then [ask a separate question](http://stackoverflow.com/questions/ask) (the answer is probably to configure `PYTHONIOENCODING` envvar to use character encoding that the editor expects) – jfs Jul 20 '15 at 18:46
1

I just want to let you know that the error I am getting from running this program in command line is different from all the other related questions I've seen.

Not really. You have PrintFails like everyone else.

The Windows console can't print Unicode. (This isn't strictly true, but going into exactly why, when and how you can get Unicode out of the console is a painful exercise and not usually worth it.) Trying to print a character that isn't in the console's limited encoding can't work, so Python gives you an error.

print them out (which I need an easier solution to because I cannot do .encode("utf-8") for a lot of elements

You could run the command set PYTHONIOENCODING=utf-8 before running the script to tell Python to use and encoding which can include any character (so no errors), but any non-ASCII output will still come out garbled as its encoding won't match the console's actual code page.

(Or indeed just use IDLE.)

bobince
  • 528,062
  • 107
  • 651
  • 834
0

I finally made a decision to use encode() and decode() because of the trouble that has been caused. If someone knows how to actually resolve a question, please do; also, thank you for all your answers

rassa45
  • 3,482
  • 1
  • 29
  • 43