How to read the file contents from a file?

Question

Using Python3, hope to os.walk a directory of files, read them into a binary object (string?) and do some further processing on them. First step, though: How to read the file(s) results of os.walk?

# NOTE: Execute with python3.2.2

import os
import sys

path = "/home/user/my-files"

count = 0
successcount = 0
errorcount = 0
i = 0

#for directory in dirs
for (root, dirs, files) in os.walk(path):
 # print (path)
 print (dirs)
 #print (files)

 for file in files:

   base, ext = os.path.splitext(file)
   fullpath = os.path.join(root, file)

   # Read the file into binary? --------
   input = open(fullpath, "r")
   content = input.read()
   length = len(content)
   count += 1
   print ("    file: ---->",base," / ",ext," [count:",count,"]",  "[length:",length,"]")
   print ("fullpath: ---->",fullpath)

ERROR:

Traceback (most recent call last):
  File "myFileReader.py", line 41, in <module>
    content = input.read()
  File "/usr/lib/python3.2/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 11: invalid continuation byte

Lennart Regebro · Accepted Answer · 2011-12-29T05:39:26.187

9

To read a binary file you must open the file in binary mode. Change

input = open(fullpath, "r")

to

input = open(fullpath, "rb")

The result of the read() will be a bytes() object.

edited Dec 29 '11 at 05:39

answered Dec 29 '11 at 04:29

Lennart Regebro

167,292
41
224
251

Tks, Lennart - Yes, this was the secret sauce I needed. Kinda new to Python3! – DrLou Dec 29 '11 at 17:03
It's not actually Python that 3 specific. Binary files should be opened with the 'b' flag in Python 2 as well. – Lennart Regebro Dec 29 '11 at 20:20
1

Yeah, it all seems kinda dumb to me in retrospect - but this is how we idiots learn! You're probably thinking: RTFM! Thanks again for help. – DrLou Nov 03 '14 at 21:17

score 3 · Answer 2 · edited May 23 '17 at 11:46

As some of your files are binary, they cannot be successfully decoded into unicode characters that Python 3 uses to store all strings in the interpreter. Note a large change between Python 2 and Python 3 involves the migration of the representation of Strings to unicode characters from ASCII, which means that each character cannot simply be treated as a byte (yes, text strings in Python 3 require either 2x or 4x as much memory to store as Python 2, as UTF-8 uses up to 4 bytes per character).

You thus have a number of options that will depend upon your project:

Ignore binary files, filtering by the file extension,
Read the binary files and either catch the decoding exception if and when it occurs, and skip the file, or use one of the method described in this thread How can I detect if a file is binary (non-text) in python?

In this vein, you may edit your solution to simply catch the UnicodeDecode error and skip the file.

Regardless of your decision, it is important to note that if there is a wide range of different character encodings in the files on your system, you will need to specify the encoding as Python 3.0 will assume the characters are encoded in UTF-8.

As a reference, a great presentation on Python 3 I/O: http://www.dabeaz.com/python3io/MasteringIO.pdf

Thanks for this link, and for your comments - these will be very useful in my learning process. So far, at least, all the files seem to be easily readable as binary. — DrLou, Dec 29 '11 at 17:07

How to read the file contents from a file?

2 Answers2