I am becoming more and more convinced that the business of file encodings is made as confusing as possible on purpose. I have a problem with reading a file in utf-8
encoding that contains just one line:
“blabla this is some text”
(note that the quotation marks are some fancy version of the standard quotation marks).
Now, I run this piece of Python
code on it:
import fileinput
def charinput(paths):
with open(paths) as fi:
for line in fi:
for char in line:
yield char
i = charinput('path/to/file.txt')
for item in i:
print(item)
with two results: If i run my python code from command prompt, the result is some strange characters, followed by an error mesage:
ď
»
ż
â
Traceback (most recent call last):
File "krneki.py", line 11, in <module>
print(item)
File "C:\Python34\lib\encodings\cp852.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position
0: character maps to <undefined>
I get the idea that the problem comes from the fact that Python tries to read a "wrongly" encoded document, but is there a way to order fileinput.input
to read utf-8
?
EDIT: Some really weird stuff is happening and I have NO idea how any of it works. After saving the same file as before in notepad++
, the python code now runs within IDLE and results in the following output (newlines removed):
“blabla this is some text”
while I can get the command prompt to not crash if I first input chcp 65001
. Running the file then results in
Ä»żâ€śblabla this is some text ”
Any ideas? This is a horrible mess, if you ask me, but it is vital I understand it...