72

What makes parsing a text file in 'r' mode more convenient than parsing it in 'rb' mode? Especially when the text file in question may contain non-ASCII characters.

MxLDevs
  • 19,048
  • 36
  • 123
  • 194

4 Answers4

82

This depends a little bit on what version of Python you're using. In Python 2, Chris Drappier's answer applies.

In Python 3, its a different (and more consistent) story: in text mode ('r'), Python will parse the file according to the text encoding you give it (or, if you don't give one, a platform-dependent default), and read() will give you a str. In binary ('rb') mode, Python does not assume that the file contains things that can reasonably be parsed as characters, and read() gives you a bytes object.

Also, in Python 3, the universal newlines (the translating between '\n' and platform-specific newline conventions so you don't have to care about them) is available for text-mode files on any platform, not just Windows.

Community
  • 1
  • 1
lvc
  • 34,233
  • 10
  • 73
  • 98
  • 2
    for py3, will reading in text mode automatically try to detect what type of encoding it is? I imagine having to detect encoding is quite a challenge with a bytes object. – MxLDevs Mar 10 '12 at 06:47
  • 2
    @Keikoku Detecting encoding based on a stream alone, without any metadata, is impossible - think about the various encodings that are ASCII + use the 8th bit for information rather than parity; they all share 255 valid one-byte sequences, but only half of them (the ASCII half) represent the same character in each. Python's default isn't to guess it, its a session-wide default encoding, spelled `sys.getdefaultencoding()`. On my Py3 install, its UTF-8, but you can't rely on that always being the case. – lvc Mar 10 '12 at 07:26
  • @lvc As far as I can tell, the default encoding used by `open` is given by `locale.getpreferredencoding()`, not `sys.getdefaultencoding()`. On my system (Windows with Python3.10), the former is 'cp1252', while the latter is 'utf-8'. – kadee Apr 06 '22 at 16:47
  • When I started reading this answer, I thought that the starting expression 'a little **bit**' was a joke :P Thank you for the explanation! – Sherlock Bourne May 31 '22 at 11:11
22

from the documentation:

On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.

Chris Drappier
  • 5,280
  • 10
  • 40
  • 64
  • So basically trying to read lines in binary mode is much more difficult because I'm not guaranteed that the EOL character is \n or \r\n or something else? – MxLDevs Mar 10 '12 at 05:47
13

The difference lies in how the end-of-line (EOL) is handled. Different operating systems use different characters to mark EOL - \n in Unix, \r in Mac versions prior to OS X, \r\n in Windows. When a file is opened in text mode, when the file is read, Python replaces the OS specific end-of-line character read from the file with just \n. And vice versa, i.e. when you try to write \n to a file opened in text mode, it is going to write the OS specific EOL character. You can find what your OS default EOL by checking os.linesep.

When a file is opened in binary mode, no mapping takes place. What you read is what you get. Remember, text mode is the default mode. So if you are handling non-text files (images, video, etc.), make sure you open the file in binary mode, otherwise you’ll end up messing up the file by introducing (or removing) some bytes.

Python also has a universal newline mode. When a file is opened in this mode, Python maps all of the characters \r, \n and \r\n to \n.

Asotos
  • 995
  • 11
  • 14
shining
  • 1,049
  • 16
  • 31
2

For clarification and to answer Agostino's comment/question (I don't have sufficient reputation to comment so bear with me stating this as an answer...):

In Python 2 no line end modification happens, neither in text nor binary mode - as has been stated before, in Python 2 Chris Drappier's answer applies (please note that its link nowadays points to the 3.x Python docs but Chris' quoted text is of course from the Python 2 input and output tutorial)

So no, it is not true that opening a file in text mode with Python 2 on non-Windows does any line end modification:

0 $ cat data.txt 
line1
line2
line3
0 $ file data.txt 
data.txt: ASCII text, with CRLF line terminators
0 $ python2.7 -c 'f = open("data.txt"); print f.readlines()'
['line1\r\n', 'line2\r\n', 'line3\r\n']
0 $ python2.7 -c 'f = open("data.txt", "r"); print f.readlines()'
['line1\r\n', 'line2\r\n', 'line3\r\n']
0 $ python2.7 -c 'f = open("data.txt", "rb"); print f.readlines()'

It is however possible to open the file in universal newline mode in Python 2, which does exactly perform said line end mod:

0 $ python2.7 -c 'f = open("data.txt", "rU"); print f.readlines()'
['line1\n', 'line2\n', 'line3\n']

(the universal newline mode specifier is deprecated as of Python 3.x)

On Python 3, on the other hand, platform-specific line ends do get normalized to '\n' when reading a file in text mode, and '\n' gets converted to the current platform's default line end when writing in text mode (in addition to the bytes<->unicode<->bytes decoding/encoding going on in text mode). E.g. reading a Dos/Win CRLF-line-ended file on Linux will normalize the line ends to '\n'.

  • Python3's open function has a newline parameter to control that if required https://docs.python.org/3/library/functions.html#open "newline controls how universal newlines mode works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows: When reading input from the stream, if newline is None, universal newlines mode is enabled" – Davos Sep 23 '17 at 13:27