opening and writing a large binary file python

Question

I have a homebrew web based file system that allows users to download their files as zips; however, I found an issue while dev'ing on my local box not present on the production system.

In linux this is a non-issue (the local dev box is a windows system).

I have the following code

algo = CipherType('AES-256', 'CBC')
decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])
file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb')
temp_file = open(temp_file_path, 'wb+')

data = file.read(settings.READ_SIZE)
while data:
    dec_data = decrypt.update(data)
    temp_file.write(dec_data)
    data = file.read(settings.READ_SIZE)

# Takes a dump right here!
# error in cipher operation (wrong final block length)
final_data = decrypt.finish()
temp_file.write(final_data)
file.close()
temp_file.close()

The above code opens a file, and (using the key for the current file share) decrypts the file and writes it to a temporary location (that will later be stuffed into a zip file).

My issue is on the file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb') line. Since windows cares a metric ton about binary files if I don't specify 'rb' the file will not read to end on the data read loop; however, for some reason since I am also writing to temp_file it never completely reads to the end of the file...UNLESS i add a + after the b 'rb+'.

if i change the code to file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb+') everything works as desired and the code successfully scrapes the entire binary file and decrypts it. If I do not add the plus it fails and cannot read the entire file...

Another section of the code (for downloading individual files) reads (and works flawlessly no matter the OS):

algo = CipherType('AES-256', 'CBC')
decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])
file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb')

filename = smart_str(cur_file.name, errors='replace')
response = HttpResponse(mimetype='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename="' + filename + '"'

data = file.read(settings.READ_SIZE)
while data:
    dec_data = decrypt.update(data)
    response.write(dec_data)
    data = file.read(settings.READ_SIZE)

# no dumps to be taken when finishing up the decrypt process... 
final_data = decrypt.finish()
temp_file.write(final_data)
file.close()
temp_file.close()

Clarification

The cipher error is likely because the file was not read in its entirety. For example, I have a 500MB file I am reading in at 64*1024 bytes at a time. I read until I receive no more bytes, when I don't specify b in windows it cycles through the loop twice and returns some crappy data (because python thinks it is interacting with a string file not a binary file).

When I specify b it takes 10-15 seconds to completely read in the file, but it does it succesfully, and the code completes normally.

When I am concurrently writing to another file as i read in from the source file (as in the first example) if I do not specify rb+ it displays the same behavior as not even specifying b which is, that it only reads a couple segments from the file before closing the handle and moving on, i end up with an incomplete file and the decryption fails.

What does "takes a dump here", "will not read to end", etc., mean? What exactly happens? — abarnert, Sep 06 '13 at 19:27
@abarnert clarification added. - to note, I know that adding `+` seems to fix this, but I want to know why it does on windows and what is going on when it opens the file handle. The python docs in this area are quite weak. — Mike McMahon, Sep 06 '13 at 19:34
In the other section of code there's no `decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])`, so is it still using the `decrypt` from the first section? — martineau, Sep 06 '13 at 19:41
@mart it's creating another decrypt object, but that's unimportant to the question at hand. The cipher issue is a result of python not properly reading the contents of a file unless i specify mode `'rb+'` where `'rb'` should suffice... — Mike McMahon, Sep 06 '13 at 19:43
@MikeMcMahon: I'm confused by your clarification. You say that without the `b` it cycles through the loop twice, and then you say that with `b` but not `+` it does the same thing as without `b`, and then you say that it reads a couple segments and then closes the handle. That's contradictory. And neither one makes any sense. The only difference between `r` and `rb` is that `r` will translate line endings (so each pair of bytes that matches 0D 0A will be replaced by a single byte 0A). — abarnert, Sep 06 '13 at 20:02
I'm not able to reproduce the issue on my Windows 7 system. When I copy a 1GB file (opened with mode `'rb'`) to another file (opened with mode `'w+b'`) it works as expected. removing the `+` from the output file didn't change anything, nor did adding it to the input file. Can you try to gather some more info about what's happening when the file isn't getting read? Is `file` still open (check `file.closed`)? Can you do another `read` and get actual data from further into the file? — Blckknght, Sep 06 '13 at 20:34
hey @abarnert - the issue is as follows if the line reads just 'rb' it loops twice, the first time to `read()` returns a few erroneous bytes, the second call to read (within the loop) returns more bytes, and then finally the third call returns nothing. The entire file is not read. when it reads `rb+` it continuously reads without issue. This is on a win7 vm, normally it runs on the latest LTS of ubuntu. — Mike McMahon, Sep 06 '13 at 22:12
Is it possible that some other program is opening and modifying the file on you, or replacing it? — abarnert, Sep 06 '13 at 22:19
Well on my local dev system i don't believe so. This is a vmbox dedicated to running this software. It's powered by django, but it seems strange that `'rb+'` and `'rb'` would make such a difference (when it doesn't in any other area of the code). if it were an intermittent handle being generated by something third party i probably would not see my results changing between the addition of the `+` — Mike McMahon, Sep 06 '13 at 22:28
@MikeMcMahon: `rb+` would _definitely_ make a difference if the file were being overwritten on you by some other program—it would make the other program fail instead of making this program read garbage. I've written an answer to explain in more detail. — abarnert, Sep 06 '13 at 22:32

abarnert · Accepted Answer · 2013-09-06T22:31:17.243

I'm going to take a guess here:

You have some other program that's continually replacing the files you're trying to read.

On linux, this other program works by atomically replacing the file (that is, writing to a temporary file, then moving the temporary file to the path). So, when you open a file, you get the version from 8 seconds ago. A few seconds later, someone comes along and unlinks it from the directory, but that doesn't affect your file handle in any way, so you can read the entire file at your leisure.

On Windows, there is no such thing as atomic replacement. There are a variety of ways to work around that problem, but what many people do is to just rewrite the file in-place. So, when you open a file, you get the version from 8 seconds ago, start reading it… and then suddenly someone else blanks the file to rewrite it. That does affect your file handle, because they've rewritten the same file. So you hit an EOF.

Opening the file in r+ mode doesn't do anything to solve the problem, but it adds a new problem that hides it: You're opening the file with sharing settings that prevent the other program from rewriting the file. So, now the other program is failing, meaning nobody is interfering with this one, meaning this one appears to work.

In fact, it could be even more subtle and annoying than this. Later versions of Windows try to be smart. If I try to open a file while someone else has it locked, instead of failing immediately, it may wait a short time and try again. The rules for exactly how this works depend on the sharing and access you need, and aren't really documented anywhere. And effectively, whenever it works the way you want, it means you're relying on a race condition. That's fine for interactive stuff like dragging a file from Explorer to Notepad (better to succeed 99% of the time instead of 10% of the time), but obviously not acceptable for code that's trying to work reliably (where succeeding 99% of the time just means the problem is harder to debug). So it could easily work differently between r and r+ modes for reasons you will never be able to completely figure out, and wouldn't want to rely on if you could…

Anyway, if any variation of this is your problem, you need to fix that other program, the one that rewrites the file, or possibly both programs in cooperation, to properly simulate atomic file replacement on Windows. There's nothing you can do from just this program to solve it.*

* Well, you could do things like optimistic check-read-check and start over whenever the modtime changes unexpectedly, or use the filesystem notification APIs, or… But it would be much more complicated than fixing it in the right place.

i'll investigate with the process explorer tools, but i cannot find anything that is altering the files (or continuously writing to them).. — Mike McMahon, Sep 06 '13 at 22:38
while i can't find a race condition you did explain the difference between `r+b` and `rb` best and why it __would__ cause a difference i how the file was handled (locking it vs sharing). This is mostly a trivial issue at this point since I can't seem to find what may be touching it (as nothing should) and because we run this primarily in a *nix environment where (as you said) it does support atomic file operations. — Mike McMahon, Sep 06 '13 at 22:52
@MikeMcMahon: If it's on a local drive, [`FindFirstChangeNotification`](http://msdn.microsoft.com/en-us/library/windows/desktop/aa365261(v=vs.85).aspx) is the best way to find intermittent accesses to the file, and there are tools that wrap that up nicely (I think sysinternals has a command-line tool, and it's in one of their GUI tools). If it's on an SMB share, that's a whole other ball of wax, but hopefully you don't have to deal with that. — abarnert, Sep 06 '13 at 22:57

opening and writing a large binary file python

Clarification

1 Answers1