Unexpected results from Path.read_text (of pathlib) when reading utf-8 encoded file

Question

Today I learned that for open(filename).read() we cannot expect that the resources bound to the hidden file object are given back immediately, although I observed this on my system. (See the accepted answer of the question Does reading an entire file leave the file handle open?).

The second answer made me resist to roll my own helper function, it told me that pathlib already offers exactly this function.

But actually, this seems not to be the case. With the following script (test.py), I get different results:

# The German accent characters are Ä,Ö,Ü,ä,ö,ü, and ß.
from pathlib import Path;

def pathlib_read_text(filename, encoding=None):
    return Path(filename, encoding=encoding).read_text()

def mylocal_read_text(filename, encoding=None):
    with open(filename, encoding=encoding) as f:
        return f.read()

def test(fun):
    print(fun+'_read_text:')
    print(eval(fun+'_read_text')(__file__, 'utf-8'))

test('pathlib')
test('mylocal')

The output to the Windows console (python test.py) contains Ã",Ã-,Ão,Ã¤,Ã¶,Ã¼, and ÃY. in the first block, when I redirect the output into a file, I get the second block wrong (In Notepad++ it's displayed xC4,xD6,xDC,xE4,xF6,FC, and xDF in White on Black) if the file is treated as utf-8.

Is there anything I overlooked?

I tried to examine the 3.6.3 code, but found no bug so far ...

Edit

The following version reinforces my feeling that it's a bug in pathlib or in one of the underlying libraries/functions. Maybe it's only a Windows issue, where the default encoding is mostly different from utf-8. Now it's sufficient to run the test in a console window.

accents = '''
Ä,Ö,Ü,ä,ö,ü,ß
'''
from pathlib import Path;
import codecs

def pathlib_read_text(filename, encoding=None, errors=None):
    return Path(filename, encoding=encoding, errors=errors).read_text()

def mylocal_read_text(filename, encoding=None, errors=None):
    with open(filename, encoding=encoding, errors=errors) as f:
        return f.read()

def space_it(error):
    return ' ';
codecs.register_error('space_it', space_it)

def test(fun):
    s = eval(fun+'_read_text')(__file__, 'utf-8', errors='space_it')
    print(fun+'_read_text:', s.split("\n")[1] == accents.strip())

test('pathlib')
test('mylocal')

It produces the following output:

pathlib_read_text: False
mylocal_read_text: True

I can't reproduce your problem with redirecting the output to a file (I get the same output as in the console), but I can confirm that `read_text` garbles the special characters. — Aran-Fey, Mar 22 '18 at 12:33
@Aran-Fey probably also Windows? I'm trying to [get it running on ideone.com](https://ideone.com/104z2P) with no luck so far ... — Wolf, Mar 22 '18 at 12:40
By the way, did you notice that `read_text` also accepts `encoding` as a parameter? That essentially did it for me. — Juho, Mar 05 '22 at 17:20
Thanks, @Juho, for letting me know. What about making an answer of this? — Wolf, Mar 06 '22 at 14:59

Unexpected results from Path.read_text (of pathlib) when reading utf-8 encoded file

Edit

0 Answers0

Linked