3

I am trying to read a file that contains an ASCII header and binary data sections, but the Python interpreter appears to close the file prematurely (i.e. before the end of file is reached). Here is my code, developed in Python 2.7.12:

fileSize = os.path.getsize(filename) # file size in bytes
bytesRead = 0L
content = []
with open(filename,'r') as f:
    content = f.read()
    bytesRead += sys.getsizeof(content)

print 'File size:',fileSize
print 'Total read:',bytesRead

However, the file is closed prematurely after around 1MB of the total 77MB of the file has been read.

print 'File size:',fileSize
print 'Total read:',bytesRead

produces: File size: 76658457, Total read: 1165436

It exits within one of the binary sections. I moified the original program to iteratively re-open the file from the point that it was closed, as follows:

fileSize = os.path.getsize(filename) # file size in bytes
bytesRead = 0L
content = []

try:
    while True:
        count += 1
        with open(filename,'r') as f:
            f.seek(bytesRead+1)
            newContent = f.read()
            content.append(newContent)
            bytesRead += sys.getsizeof(newContent)
            print count,' Total read:',bytesRead
except Exception,e:
    print e

print 'File size:',fileSize
print '% read = ',bytesRead*100./float(fileSize)
print 'count: ',count

This gave:

1 Total read: 1165436
2 Total read: 1180218
3 Total read: 1181902

... [many more iterations] ...

25564 Total read: 77925641
25567 Total read: 77926615
25568 Total read:Exception: I/O operation on closed file

File size: 76658457
% read =  101.65429721603
count:  25568

Any idea how I can persuade Python not to keep closing the file, and just read it all in one go?

SWS
  • 115
  • 1
  • 9
  • This is an odd output. First, you don't even increment `count`, how can those `print` commands print `1`, `2`, `3` etc? Second, Python 2.7 docs explicitly state that, when `read()` is called without arguments, it will read the whole file (https://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects). Are you sure the code you are showing is the one which is actually being executed? – lucasnadalutti Dec 02 '16 at 17:26
  • 3
    Some OSs, like Windows, handle reading text files differently than binary files. For example, in the way they handle newline characters. Windows text files can also have an EOF (end-of-file) character in them. It looks like your code is reading the file in the default text mode. Try opening the file in binary mode with `with open(filename,'rb') as f:` – martineau Dec 02 '16 at 17:57
  • 3
    You need to open the file in binary mode. When you don't include the `b` under mode, python will alter the content of the file, possibly inserting an EOF or newline. – Charles D Pantoga Dec 02 '16 at 18:03
  • 1
    Never use ``sys.getsizeof()``. In your case you want to use ``len()``. – Armin Rigo Dec 02 '16 at 19:33
  • Added the increment to count - otherwise code is the one that produced the output. I'll test it shortly, but I suspect the answer is to open the file in binary mode, as suggested by martineau. – SWS Dec 02 '16 at 20:08
  • Note that I'm comparing the total number of bytes on the file to the number of bytes that has been read. So I don't think len() would be correct. – SWS Dec 02 '16 at 20:10

1 Answers1

0

What I believe to be happening is that you are checking the byte size, rather than the actual data in the array. Check the last line of the array, and then the last line of the file, and it will most likely reveal the fact that it is in fact reading the entire file

tgs266
  • 188
  • 1
  • 3
  • 13