Opening a file in text mode can lead to data loss in Python: why?

Question

The documentation for codecs.open() mentions that

Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values.

How can using text mode for a file lead to "data loss"? It sounds like opening a file in text mode might truncate bytes to 7 bits, but I can't find any mention of this in the documentation: the text mode is described only as a way of converting newlines, with no mention of some potential data loss. So, what does the documentation for codecs.open() refer to?

PS: While it is understandable that an automatic newline conversion to the platform-dependent newline encoding requires some care, the question is about what is specific about 8-bit encodings. I would have guessed that only some encodings are compatible with the automatic newline conversion, irrespective of whether they are 8- or 7-bit encodings. So, why are 8-bit encodings singled out, in codecs.open()'s documentation?

I would guess that the converting newlines could have an effect on the data in the file. The data would mean the same but for example, \r\n might get converted to just \n. I'm really not sure but am equally curious now. +1 in the hope that somebody has an answer. — Endophage, May 17 '11 at 20:32
@Endophage: Yeah, text mode definitely performs the proper conversion of `\n` upon writing (this is well documented). The reverse operation is given by the `U` mode (universal newline). — Eric O. Lebigot, May 18 '11 at 12:42

score 5 · Answer 1 · answered May 17 '11 at 20:35

5

I think what they mean is that some encodings use all 8 bits in at least some bytes, so that all 256 values are possible (and in particular, it's possible to get 0x0A or 0x0D that don't mean CR or LF).

In contrast, in a UTF-8 file, CR and LF characters (and all other character below 0x80) always translate to themselves. They cannot appear as a part of encoding of some other character.

answered May 17 '11 at 20:35

Igor Nazarenko

2,184
13
10

Interesting. Now, if I understand correctly, you are saying that `codecs.open()` creates a file object that first encodes, and then performs the newline conversions that could "damage" the encoded bytes, right? I would expect `codecs` to do the opposite: first perform the newline conversion, then encode, which would allow `codecs.open` to write in text mode… Is this correct? If yes, why didn't they do that?? – Eric O. Lebigot May 18 '11 at 12:59
Even though these remarks about UTF-8 are interesting, I don't see how they explain the problem with "encodings using 8-bit values": ISO-8859-1 (aka Latin 1) also encodes on 8-bit values, and the conversion of newlines works fine with Latin 1, like it does with UTF-8. So, it is not clear what the problem with "8-bit encodings" is… – Eric O. Lebigot Jun 01 '11 at 19:49
… I would add that one can imagine 7-bit encodings that use 0x0A and 0x0D as part of the encoding of some characters (which maybe would mess up with `codecs`??). So, again, I don't see what is specific with 8-bit encodings and their relationship with `codecs.open()` forcing binary mode. Any additional discussion would be welcome! – Eric O. Lebigot Jul 30 '11 at 07:23

Opening a file in text mode can lead to data loss in Python: why?

1 Answers1