Unicode file with python and fileinput

Question

I am becoming more and more convinced that the business of file encodings is made as confusing as possible on purpose. I have a problem with reading a file in utf-8 encoding that contains just one line:

“blabla this is some text”

(note that the quotation marks are some fancy version of the standard quotation marks).

Now, I run this piece of Python code on it:

import fileinput
def charinput(paths):
    with open(paths) as fi:
        for line in fi:
            for char in line:
                yield char
i = charinput('path/to/file.txt')
for item in i:
    print(item)

with two results: If i run my python code from command prompt, the result is some strange characters, followed by an error mesage:

ď
»
ż
â
Traceback (most recent call last):
  File "krneki.py", line 11, in <module>
    print(item)
  File "C:\Python34\lib\encodings\cp852.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position
0: character maps to <undefined>

I get the idea that the problem comes from the fact that Python tries to read a "wrongly" encoded document, but is there a way to order fileinput.input to read utf-8?

EDIT: Some really weird stuff is happening and I have NO idea how any of it works. After saving the same file as before in notepad++, the python code now runs within IDLE and results in the following output (newlines removed):

ď»żâ€śblabla this is some textâ€ť

while I can get the command prompt to not crash if I first input chcp 65001. Running the file then results in

ÄÂ»Å¼Ã¢â‚¬Å›blabla this is some text Ã¢â‚¬Å¥

Any ideas? This is a horrible mess, if you ask me, but it is vital I understand it...

I don't use windows but I know there are problems with unicode and the windows shell. Try `chcp 65001` in shell — Padraic Cunningham, Jul 15 '14 at 10:04
@PadraicCunningham I wrote my input file at the beginning of my question, it has only that line, so only `“blabla this is some text”` — 5xum, Jul 15 '14 at 10:14
@PadraicCunningham changed fileinput to open, same problem. Also, the command prompt now (after saving my file in notepad++) returns some characters and then crashes with a different message. — 5xum, Jul 15 '14 at 10:20
Not an exact duplicate, but the answer http://stackoverflow.com/a/11544596/3218018 will be helpfull — Lord_Gestalter, Jul 15 '14 at 10:53

Yurim · Answer 1 · 2015-10-29T20:16:28.370

Encoding

Every file is encoded. The byte 0x4C is interpreted as latin capital letter L according to the ASCII encoding, but as less-than sign ('<') according to the EBCDIC encoding. There Ain't No Such Thing As Plain Text.

There are single byte character sets like ASCII that use a single byte to encode each symbol, there are double byte character sets like KS X 1001 that use two bytes to encode each symbol, and there are encodings like the popular UTF-8 that use a variable number of bytes per symbol.

UTF-8 has become the most popular encoding for new applications, so I'll give some examples: The Latin Capital Letter A is stored as a single byte: 0x41. The Left Double Quotation Mark (“) is stored as three bytes: 0xE2 0x80 0x9C. The emoji Pile of Poo is stored as four bytes: 0xF0 0x9F 0x92 0xA9.

Any program that reads a file and has to interpret the bytes as symbols has to know (or to guess) which encoding was used.

If you are not familiar with Unicode or UTF-8 you might want to read http://www.joelonsoftware.com/articles/unicode.html

Reading Files in Python 3

Python 3's builtin function open() has an optional keywords argument encoding to support different encodings. To open a UTF-8 encoded file you can write open(filename, encoding="utf-8") and Python will take care of the decoding.

Also, the fileinput module supports encodings via the openhook keyword argument: fileinput.input(filename, openhook=fileinput.hook_encoded("utf-8")).

If you are not familiar with Python and Unicode or UTF-8 you should read http://docs.python.org/3/howto/unicode.html I also found some nice tricks in http://www.chirayuk.com/snippets/python/unicode

Reading Strings in Python 2

In Python 2 open() does not know about encodings. Instead you can use the codecs module to specify which encoding should be used: codecs.open(filename, encoding="utf-8")

The best source for Python2/Unicode enlightment is http://docs.python.org/2/howto/unicode.html

Regarding fileinput.input() you cannot combine openhook and inplace according to the python 3 docs. What if I need both because I want to change a file linewise whose encoding is not guessed/set. — Timo, Nov 11 '20 at 11:43
@Timo I came here for the same reason and so far I only found [this](https://stackoverflow.com/questions/25203040/combining-inplace-filtering-and-the-setting-of-encoding-in-the-fileinput-module) answer which suggests to just do it yourself — lucidbrot, Nov 22 '20 at 11:06
@lucidbrot [this](https://github.com/tik9/ml/blob/master/tess_md.py) is how I did it, so using `open` twice, for reading and wrting. — Timo, Nov 22 '20 at 19:36

Unicode file with python and fileinput

1 Answers1

Encoding

Reading Files in Python 3

Reading Strings in Python 2