-2

I have a python file that reads a file given by the user, processes it, and ask questions in flash card format. The program works fine with an english txt file but I encounter errors when trying to process a french file.

When I first encountered the error, I was using the windows command prompt window and running python cards.py. When inputting the french file, I immediately got a UnicodeEncodeError. After digging around, I found that it may have something to do with the fact I was using the cmd window. So I tried using IDLE. I didn't get any errors but I would get weird characters like œ and à and ®.

Upon further research, I found some documentation that instructs to use encoding='insert encoding type' in the open(file) part of my code. After running the program again in IDLE, it seemed to minimize the problem, but I would still get some weird characters. When running it in the cmd, it wouldn't break IMMEDIATELY, but would eventually when it encountered an unknown character.

My question: what do I implement to ensure the program can handle ALL of the chaaracters in the file (given any language) and why does IDLE and the command prompt handle the file differently?

EDIT: I forgot to mention that I ended up using utf-8 which gave the results I described.

Dave
  • 503
  • 1
  • 8
  • 21
  • Possible duplicate of [Setting the correct encoding when piping stdout in Python](http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python) – thodnev Aug 20 '16 at 00:37
  • 2
    Note that you can issue the command `chcp 65001` in the command prompt to switch to a Unicode (UTF-8) code page. – Basic Aug 20 '16 at 01:04
  • You didn't mention if you are using python 2 or 3.. There are big differences between the two when it comes to unicode. In short, you may find it easier to use 3. – fzzylogic Aug 20 '16 at 01:12
  • @fzzylogic I didn't directly say, correct but I included the python-3.x tag. Thanks – Dave Aug 20 '16 at 01:17
  • @Basic `chcp 65001` should only be considered a quick fix. It does not fully support utf-8 and does not allow Python to receive multibyte characters properly – Alastair McCormack Aug 21 '16 at 10:15

2 Answers2

1

It's common question. Seems that you're using cmd which doesn't support unicode, so error occurs during translation of output to the encoding, which your cmd runs. And as unicode has a wider charset, than encoding used in cmd, it gives an error

IDLE is built ontop of tkinter's Text widget, which perfectly supports Python strings in unicode.

And, finally, when you specify a file you'd like to open, the open function assumes that it's in platform default (per locale.getpreferredencoding()). So if your file encoding differs, you should exactly mention it in keyword arg encoding to open func.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
thodnev
  • 1,564
  • 16
  • 20
  • I don't suppose there's a way to be able to process any type of encoding? Or do you have to specify each one? – Dave Aug 20 '16 at 00:34
  • @David thats how this New World Order of Unicode looks like. The real encoding could be only guessed if not known, as primer example -- the `chardet` lib. But some encodings have markers and you could tell exactly that's it – thodnev Aug 20 '16 at 00:40
  • @fzzylogic according to docs, `open` relies on platform's default encoding. In most cases it's utf-8, but not it all. – thodnev Aug 20 '16 at 00:42
  • @thodnev Tx for the correction. David, if it's possible in your case, you may find it easier to standardise with utf-8. – fzzylogic Aug 20 '16 at 00:52
  • @thodnev See my edit. I ended up using utf-8. I guess I won't be able to fully decode the file unless given the encoding type. As I said in the original post, even using utf-8 I still get weird characters – Dave Aug 20 '16 at 00:58
  • @David seems like you've done with the first part, and the second is connected with handling Unicode in Windows console. You may use custom console to run Python in (like PowerShell) or do some hacks with standard cmd.exe (you could run your py script using some .bat with boilerplate code, like `@echo off` and `chcp 65001`), maybe even possible to run it with `os.system` not messing the stdout with `chcp` output. Hard to tell as I don't have a Windows machine nearby – thodnev Aug 20 '16 at 20:57
0

The Windows console does not natively support Unicode (despite what people say about chcp 65001). It's designed to be backwards compatible so only supports 8bit character sets.

Use win-unicode-console instead. It talks to the cmd at a lower level, which allows all Unicode characters to be printed, and importantly, inputted.

The best way to enable it is in your usercustomize script, so that's enabled by default on your machine.

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100