Reading a multibyte text file in Windows - how does it detect newlines? (Python 2)

Question

I thought this was a caveat of a Unicode world -> you cannot correctly process a byte stream as writing without knowing what the encoding is. If you assume an encoding, then you might get valid - but incorrect - characters showing up.

Here's a test - a file with the writing:

hi1
hi2

stored on disk with a 2-byte Unicode encoding:

Windows newline characters are \r\n stored as the four byte sequence 0D 00 0A 00. Open it in Python 2 with default encodings, I think it's expecting ASCII 1-byte-per-character (or just a stream of bytes), and it reads:

>>> open('d:/t/hi2.txt').readlines()
['\xff\xfeh\x00i\x001\x00\r\x00\n', 
 '\x00h\x00i\x002\x00']

It's not decoding two bytes into one character, yet the four byte line ending sequence has been detected as two characters, and the file has been correctly split into two lines.

Presumably, then, Windows opened the file in 'text mode', as described here: Difference between files writen in binary and text mode

and fed the lines to Python. But how did Windows know the file was multibyte encoded, and to look for four-bytes of newlines, without being told, as per the caveat at the top of the question?

Does Windows guess, with a heuristic - and therefore can be wrong?
Is there more cleverness in the design of Unicode, something which makes Windows newline patterns unambiguous across encodings?
Is my understanding wrong, and there is a correct way to process any text file without being told the encoding beforehand?

Windows has no concept of "text mode". You're talking about the C runtime, but that defaults to ANSI text mode, which doesn't detect a UTF-16 `\r\x00\n\x00` sequence. What you're actually seeing is the implementation of the [`file.readlines`](https://hg.python.org/cpython/file/v2.7.12/Objects/fileobject.c#l1659) method in Python 2. See line 1717, `p = (char *)memchr(buffer+nfilled, '\n', nread)` and then line 1749, `line = PyString_FromStringAndSize(q, p-q)`. It naively consumes up to a `\n` character, which is why the actual UTF-16LE `\n\x00` gets split. — Eryk Sun, Jul 15 '16 at 18:46
*Windows has no concept of "text mode"* - [fopen() at MSDN](https://msdn.microsoft.com/en-us/library/yeby3zcb%28vs.71%29.aspx) - "*in text mode, carriage return–linefeed combinations are translated into single linefeeds on input, and linefeed characters are translated to carriage return–linefeed combinations on output. When a Unicode stream-I/O function operates in text mode (the default), the source or destination stream is assumed to be a sequence of multibyte characters.*". But it makes sense that the unicode newline quad is being shattered by Python, that's probably my answer. — TessellatingHeckler, Jul 15 '16 at 19:49
Microsoft's `_open`, `fopen`, `_read`, `fread`, etc are all C runtime functions. They're not inherently part of Windows file I/O, i.e. `CreateFile`, `ReadFile`, etc. Using Microsoft's CRT is optional. You could use a different CRT that has no "text mode", or you could just use the Windows API directly. It happens that Windows Python is built using MSVC and uses the CRT's low I/O and standard I/O runtime functions to facilitate cross-platform compatibility with POSIX operating systems such as Linux and OS X. — Eryk Sun, Jul 15 '16 at 20:02
@eryksun then if you'll put your link to the Python source in a full answer instead of a comment, I'll mark it as accepted. Thank you. — TessellatingHeckler, Jul 15 '16 at 20:40

Eryk Sun · Accepted Answer · 2016-07-15T22:51:51.740

The result in this case has nothing to do with Windows or the standard I/O implementation of Microsoft's C runtime. You'll see the same result if you test this in Python 2 on a Linux system. It's just how file.readlines (2.7.12 source link) works in Python 2. See line 1717, p = (char *)memchr(buffer+nfilled, '\n', nread) and then line 1749, line = PyString_FromStringAndSize(q, p-q). It naively consumes up to a \n character, which is why the actual UTF-16LE \n\x00 sequence gets split up.

If you had opened the file using Python 2's universal newlines mode, e.g. open('d:/t/hi2.txt', 'U'), the \r\x00 sequences would naively be translated to \n\x00. The result of readlines would instead be ['\xff\xfeh\x00i\x001\x00\n, \x00\n', '\x00h\x00i\x002\x00'].

Thus your initial supposition is correct. You need to know the encoding, or at least know to look for a Unicode BOM (byte order mark) at the start of the file, such as \xff\xfe, which indicates UTF-16LE (little endian). To that end I recommend using the io module in Python 2.7, since it properly handles newline translation. codecs.open, on the other hand, requires binary mode on the wrapped file and ignores universal newline mode:

>>> codecs.open('test.txt', 'U', encoding='utf-16').readlines()
[u'hi1\r\n', u'hi2']

io.open returns a TextIOWrapper that has built-in support for universal newlines:

>>> io.open('test.txt', encoding='utf-16').readlines()
[u'hi1\n', u'hi2']

Regarding Microsoft's CRT, it defaults to ANSI text mode. Microsoft's ANSI codepages are supersets of ASCII, so the CRT's newline translation will work for files encoded with an ASCII compatible encoding such as UTF-8. On the other hand, ANSI text mode doesn't work for a UTF-16 encoded file, i.e. it doesn't remove the UTF-16LE BOM (\xff\xfe) and doesn't translate newlines:

>>> open('test.txt').read()
'\xff\xfeh\x00i\x001\x00\r\x00\n\x00h\x00i\x002\x00'

Thus using standard I/O text mode for a UTF-16 encoded file requires the non-standard ccs flag, e.g. fopen("d:/t/hi2.txt", "rt, ccs=UNICODE"). Python doesn't support this Microsoft extension to the open mode, but it does make the CRT's low I/O (POSIX) _open and _read functions available in the os module. While it might surprise POSIX programmers, Microsoft's low I/O API also supports text mode, including Unicode. For example:

>>> O_WTEXT = 0x10000
>>> fd = os.open('test.txt', os.O_RDONLY | O_WTEXT)
>>> os.read(fd, 100)
'h\x00i\x001\x00\n\x00h\x00i\x002\x00'
>>> os.close(fd)

The O_WTEXT constant isn't made directly available in Windows Python because it's not safe to open a file descriptor with this mode as a Python file using os.fdopen. The CRT expects all wide-character buffers to be a multiple of the size of a wchar_t, i.e. a multiple of 2. Otherwise it invokes the invalid parameter handler that kills the process. For example (using the cdb debugger):

>>> fd = os.open('test.txt', os.O_RDONLY | O_WTEXT)
>>> os.read(fd, 7)
ntdll!NtTerminateProcess+0x14:
00007ff8`d9cd5664 c3              ret
0:000> k8
Child-SP          RetAddr           Call Site
00000000`005ef338 00007ff8`d646e219 ntdll!NtTerminateProcess+0x14
00000000`005ef340 00000000`62db5200 KERNELBASE!TerminateProcess+0x29
00000000`005ef370 00000000`62db52d4 MSVCR90!_invoke_watson+0x11c
00000000`005ef960 00000000`62db0cff MSVCR90!_invalid_parameter+0x70
00000000`005ef9a0 00000000`62db0e29 MSVCR90!_read_nolock+0x76b
00000000`005efa40 00000000`1e056e8a MSVCR90!_read+0x10d
00000000`005efaa0 00000000`1e0c3d49 python27!Py_Main+0x12a8a
00000000`005efae0 00000000`1e1146d4 python27!PyCFunction_Call+0x69

The same applies to _O_UTF8 and _O_UTF16.

That's comprehensive, and conclusive. I should have noticed the `\n\x00` being split in half and suspected something there. (Now I'm wondering, if Windows has no concept of a text mode, how is it that Windows has a fixed text file line ending sequence - is it just a convention to use `\r\n`? Or is your statement "*Microsoft's low I/O API also supports text mode*" a change since your comment?) — TessellatingHeckler, Jul 16 '16 at 04:07
Microsoft's CRLF convention was inherited from CP/M, which inherited it from TOPS-10. Microsoft's language libraries, whether it's native C/C++ or .NET based, typically [hard code CRLF](https://msdn.microsoft.com/en-us/library/system.environment.newline) as the line ending, but a .NET `TextWriter` does allow setting [`NewLine`](https://msdn.microsoft.com/en-us/library/system.io.textwriter.newline). However, this Windows convention isn't hard coded in the OS. The I/O API of Windows itself is binary. It doesn't have a notion of text 'lines'. Many programs use the Unix LF convention on Windows. — Eryk Sun, Jul 16 '16 at 04:54

score 0 · Answer 2 · answered Jul 15 '16 at 18:08

First things first: open your file as text, indicating the correct encodin,and in explicit text mode.

If you are still using Python 2.7, use codecs.open instead of open. In Python 3.x, just use open:

import codecs
myfile = codecs.open('d:/t/hi2.txt', 'rt', encoding='utf-16')

And you should be able to work on it.

Second, what is likley going on there: Since you did not specify you were opening the file in binary mode, Windows open it in "text" mode - Windows does know about the encoding, and thus, can find the \r\n sequences in the lines - it reads the lines separately, performing the end-of-line translation - using utf-16 - and passes those utf-16 bytes to Python.

On the Python side, you could use these values, just decoding them to text:

[line.decode("utf-16" for line in open('d:/t/hi2.txt')]

instead of

 open('d:/t/hi2.txt').readlines()

That's good advice, but it's not relevant to my question, I'm not asking how to handle Unicode files in Python 2, I'm asking how the file data is correctly split into two lines even with an encoding which makes that "impossible". - `Windows does know about the encoding` - how? How can it possibly know? If it knows, why wouldn't Python know? — TessellatingHeckler, Jul 15 '16 at 18:27
I've addressed that on the paragraph that starts "Second..." — jsbueno, Jul 15 '16 at 22:27

Reading a multibyte text file in Windows - how does it detect newlines? (Python 2)

2 Answers2