13

I have the following file:

abcde
kwakwa
<0x1A>
line3
linllll

Where <0x1A> represents a byte with the hex value of 0x1A. When attempting to read this file in Python as:

for line in open('t.txt'):
    print line,

It only reads the first two lines, and exits the loop.

The solution seems to be to open the file in binary (or universal newline mode) - 'rb' or 'rU'. Can you explain this behavior ?

tzot
  • 92,761
  • 29
  • 141
  • 204
Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412
  • How do you know that the byte represented <0x1A>. For me it just says 'SUB' in notepadd ++ – Programmer Aug 23 '12 at 06:13
  • Another work-around is to use Python 3 or [`io.open()`](https://docs.python.org/2/library/io.html#io.open) in Python 2; the `io` file objects always use the file in binary mode as far as the OS Is concerned and so Windows won't 'end' the file prematurely. – Martijn Pieters Apr 19 '16 at 21:24

2 Answers2

28

0x1A is Ctrl-Z, and DOS historically used that as an end-of-file marker. For example, try using a command prompt, and "type"ing your file. It will only display the content up the Ctrl-Z.

Python uses the Windows CRT function _wfopen, which implements the "Ctrl-Z is EOF" semantics.

Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • Reminds me that I once had to build a PostScript document with LaTeX that included PostScript images created on Windows. I wondered why the printer stopped printing after the first picture ... Well, the last byte in the PostScript picture files was 0x1A. –  Jan 01 '09 at 16:20
9

Ned is of course correct.

If your curiosity runs a little deeper, the root cause is backwards compatibility taken to an extreme. Windows is compatible with DOS, which used Ctrl-Z as an optional end of file marker for text files. What you might not know is that DOS was compatible with CP/M, which was popular on small computers before the PC. CP/M's file system didn't keep track of file sizes down to the byte level, it only kept track by the number of floppy disk sectors. If your file wasn't an exact multiple of 128 bytes, you needed a way to mark the end of the text. This Wikipedia article implies that the selection of Ctrl-Z was based on an even older convention used by DEC.

Community
  • 1
  • 1
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622