2

I have the following code. Because I opened the file in binary mode, what will be read into the variable "line". Is it

  1. Read until the occurrence of a new line character. Return and repeat
  2. Read until some internal buffer size OR new line character. Return and repeat
with open('filename', mode='rb') as f:
    for line in f:
      do_some_process(line)

The other answers suggested earlier do not answer this question. They talk about the differences between the modes but here the question is, suppose Im reading a text file in binary mode and I want to ensure that I read line by line i.e. read until the occurrence of a newline character. The 'b' mode seems to be doing this, but is this guaranteed to always happen? Does 'b' mode read data until the occurrence of a new line or till the buffere size ? Im trying to undertand how Python handles this under the wraps.

kvb
  • 625
  • 3
  • 8
  • 12
  • You might want to read this: http://stackoverflow.com/questions/9644110/difference-between-parsing-a-text-file-in-r-and-rb-mode – sangheestyle Mar 03 '17 at 04:37
  • I have a slightly different question that the thread you suggested.. Because the file is in binary mode, there is no concept of end of line character. however this code still seems to read line by line. How does it do that. Does that mean it is interpreting the characters and reading until EOL? – kvb Mar 03 '17 at 08:26

1 Answers1

2

you can use both 'rt' and 'rb' on txt file , result wont be much different in situation that the language is english , look at this :

>>> f = open('test.txt','rb')
<_io.BufferedReader name='test.txt'>
>>> list(f)
[b'FzListe\n', b'7MA1, 7OS1\n', b'7MA1, 7ZJB\n', b'\n', b'\n', b'7MA2, 7MA3, 7OS1\n', b'76G1, 7MA1, 7OS1\n', b'7MA1, 7OS1\n', b'71E5, 71E6, 7MA1, FSS1\n']
>>> 
>>> f = open('test.txt','rt')
>>> list(f)
['FzListe\n', '7MA1, 7OS1\n', '7MA1, 7ZJB\n', '\n', '\n', '7MA2, 7MA3, 7OS1\n', '76G1, 7MA1, 7OS1\n', '7MA1, 7OS1\n', '71E5, 71E6, 7MA1, FSS1\n']

if the file contain multiple language,this will happen, look at the binary part that it doesn't do any decoding of characters like UTF-8 :

>>> f = open('test.txt','rt')
>>> 
>>> list(f)
['علی\n', 'FzListe\n', '7MA1, 7OS1\n', '7MA1, 7ZJB\n', '\n', '\n', '7MA2, 7MA3, 7OS1\n', '76G1, 7MA1, 7OS1\n', '7MA1, 7OS1\n', '71E5, 71E6, 7MA1, FSS1\n']
>>> 
>>> f = open('test.txt','rb')
>>> list(f)
[b'\xd8\xb9\xd9\x84\xdb\x8c\n', b'FzListe\n', b'7MA1, 7OS1\n', b'7MA1, 7ZJB\n', b'\n', b'\n', b'7MA2, 7MA3, 7OS1\n', b'76G1, 7MA1, 7OS1\n', b'7MA1, 7OS1\n', b'71E5, 71E6, 7MA1, FSS1\n']

so the answer is yes depending on the file you use you can get almost same or different result on 'rt' and 'rb' on a same file

but you can't use 'rt' on binary file like picture file , because it will fail to understand its codec and raise error :

>>> f = open('test.jpg','rt')
>>> f.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9f in position 0: invalid start byte
ali
  • 139
  • 8
  • Ali, I understand your explanation but I have a slightly different question. Because the file is in binary mode, there is no concept of end of line character. however this code still seems to read line by line. Can this be assumed to always be the case ? – kvb Mar 03 '17 at 08:25
  • if you check subclass of opend file in 'rb' mode you can see that there is .readline() , so even in 'rb' mod steel you can make it to understand the new line character at least when we using .readline() subclass, and i think when we do list(OPEND_FILES_IN_rb) it actually return all possible out put of .readline() subclass thats why we see out put inside list divided by newline character , otherwise it should return all the file as single string – ali Mar 03 '17 at 08:46
  • Not sure I follow. The typos are making it hard to follow. Could you help. – kvb Mar 04 '17 at 00:23