The result in this case has nothing to do with Windows or the standard I/O implementation of Microsoft's C runtime. You'll see the same result if you test this in Python 2 on a Linux system. It's just how file.readlines
(2.7.12 source link) works in Python 2. See line 1717, p = (char *)memchr(buffer+nfilled, '\n', nread)
and then line 1749, line = PyString_FromStringAndSize(q, p-q)
. It naively consumes up to a \n
character, which is why the actual UTF-16LE \n\x00
sequence gets split up.
If you had opened the file using Python 2's universal newlines mode, e.g. open('d:/t/hi2.txt', 'U')
, the \r\x00
sequences would naively be translated to \n\x00
. The result of readlines
would instead be ['\xff\xfeh\x00i\x001\x00\n, \x00\n', '\x00h\x00i\x002\x00']
.
Thus your initial supposition is correct. You need to know the encoding, or at least know to look for a Unicode BOM (byte order mark) at the start of the file, such as \xff\xfe
, which indicates UTF-16LE (little endian). To that end I recommend using the io
module in Python 2.7, since it properly handles newline translation. codecs.open
, on the other hand, requires binary mode on the wrapped file and ignores universal newline mode:
>>> codecs.open('test.txt', 'U', encoding='utf-16').readlines()
[u'hi1\r\n', u'hi2']
io.open
returns a TextIOWrapper
that has built-in support for universal newlines:
>>> io.open('test.txt', encoding='utf-16').readlines()
[u'hi1\n', u'hi2']
Regarding Microsoft's CRT, it defaults to ANSI text mode. Microsoft's ANSI codepages are supersets of ASCII, so the CRT's newline translation will work for files encoded with an ASCII compatible encoding such as UTF-8. On the other hand, ANSI text mode doesn't work for a UTF-16 encoded file, i.e. it doesn't remove the UTF-16LE BOM (\xff\xfe
) and doesn't translate newlines:
>>> open('test.txt').read()
'\xff\xfeh\x00i\x001\x00\r\x00\n\x00h\x00i\x002\x00'
Thus using standard I/O text mode for a UTF-16 encoded file requires the non-standard ccs
flag, e.g. fopen("d:/t/hi2.txt", "rt, ccs=UNICODE")
. Python doesn't support this Microsoft extension to the open mode
, but it does make the CRT's low I/O (POSIX) _open
and _read
functions available in the os
module. While it might surprise POSIX programmers, Microsoft's low I/O API also supports text mode, including Unicode. For example:
>>> O_WTEXT = 0x10000
>>> fd = os.open('test.txt', os.O_RDONLY | O_WTEXT)
>>> os.read(fd, 100)
'h\x00i\x001\x00\n\x00h\x00i\x002\x00'
>>> os.close(fd)
The O_WTEXT
constant isn't made directly available in Windows Python because it's not safe to open a file descriptor with this mode as a Python file
using os.fdopen
. The CRT expects all wide-character buffers to be a multiple of the size of a wchar_t
, i.e. a multiple of 2. Otherwise it invokes the invalid parameter handler that kills the process. For example (using the cdb debugger):
>>> fd = os.open('test.txt', os.O_RDONLY | O_WTEXT)
>>> os.read(fd, 7)
ntdll!NtTerminateProcess+0x14:
00007ff8`d9cd5664 c3 ret
0:000> k8
Child-SP RetAddr Call Site
00000000`005ef338 00007ff8`d646e219 ntdll!NtTerminateProcess+0x14
00000000`005ef340 00000000`62db5200 KERNELBASE!TerminateProcess+0x29
00000000`005ef370 00000000`62db52d4 MSVCR90!_invoke_watson+0x11c
00000000`005ef960 00000000`62db0cff MSVCR90!_invalid_parameter+0x70
00000000`005ef9a0 00000000`62db0e29 MSVCR90!_read_nolock+0x76b
00000000`005efa40 00000000`1e056e8a MSVCR90!_read+0x10d
00000000`005efaa0 00000000`1e0c3d49 python27!Py_Main+0x12a8a
00000000`005efae0 00000000`1e1146d4 python27!PyCFunction_Call+0x69
The same applies to _O_UTF8
and _O_UTF16
.