5

Using Python3, hope to os.walk a directory of files, read them into a binary object (string?) and do some further processing on them. First step, though: How to read the file(s) results of os.walk?

# NOTE: Execute with python3.2.2

import os
import sys

path = "/home/user/my-files"

count = 0
successcount = 0
errorcount = 0
i = 0

#for directory in dirs
for (root, dirs, files) in os.walk(path):
 # print (path)
 print (dirs)
 #print (files)

 for file in files:

   base, ext = os.path.splitext(file)
   fullpath = os.path.join(root, file)

   # Read the file into binary? --------
   input = open(fullpath, "r")
   content = input.read()
   length = len(content)
   count += 1
   print ("    file: ---->",base," / ",ext," [count:",count,"]",  "[length:",length,"]")
   print ("fullpath: ---->",fullpath)

ERROR:

Traceback (most recent call last):
  File "myFileReader.py", line 41, in <module>
    content = input.read()
  File "/usr/lib/python3.2/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 11: invalid continuation byte
tshepang
  • 12,111
  • 21
  • 91
  • 136
DrLou
  • 649
  • 5
  • 21

2 Answers2

9

To read a binary file you must open the file in binary mode. Change

input = open(fullpath, "r")

to

input = open(fullpath, "rb")

The result of the read() will be a bytes() object.

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • Tks, Lennart - Yes, this was the secret sauce I needed. Kinda new to Python3! – DrLou Dec 29 '11 at 17:03
  • It's not actually Python that 3 specific. Binary files should be opened with the 'b' flag in Python 2 as well. – Lennart Regebro Dec 29 '11 at 20:20
  • 1
    Yeah, it all seems kinda dumb to me in retrospect - but this is how we idiots learn! You're probably thinking: RTFM! Thanks again for help. – DrLou Nov 03 '14 at 21:17
3

As some of your files are binary, they cannot be successfully decoded into unicode characters that Python 3 uses to store all strings in the interpreter. Note a large change between Python 2 and Python 3 involves the migration of the representation of Strings to unicode characters from ASCII, which means that each character cannot simply be treated as a byte (yes, text strings in Python 3 require either 2x or 4x as much memory to store as Python 2, as UTF-8 uses up to 4 bytes per character).

You thus have a number of options that will depend upon your project:

In this vein, you may edit your solution to simply catch the UnicodeDecode error and skip the file.

Regardless of your decision, it is important to note that if there is a wide range of different character encodings in the files on your system, you will need to specify the encoding as Python 3.0 will assume the characters are encoded in UTF-8.

As a reference, a great presentation on Python 3 I/O: http://www.dabeaz.com/python3io/MasteringIO.pdf

Community
  • 1
  • 1
Cory Dolphin
  • 2,650
  • 1
  • 20
  • 30
  • Thanks for this link, and for your comments - these will be very useful in my learning process. So far, at least, all the files seem to be easily readable as binary. – DrLou Dec 29 '11 at 17:07