How to properly handle multiple binary files in python?

Question

I'm currently working on a multi-threaded downloader with help of PycURL module. I am downloading parts of the files and merging them afterwards.

The parts are downloaded separately from multiple threads , they are written to temporary files in binary mode, but when I merge them into single file(they are merged in correct order) , the checksums do not match.

This only happens in linux env. The same script works flawlessly in Windows env.

This is the code(part of the script) that merges the files:

with open(filename,'wb') as outfile:
    print('Merging temp files ...')
    for tmpfile in self.tempfile_arr:
        with open(tmpfile, 'rb') as infile:
            shutil.copyfileobj(infile, outfile)
    print('Done!')

I tried write() method as well , but it results with same issue, and it will take a lot of memory for large files.

If I manually cat the part files into a single file in linux, then file's checksum matches, the issue is with python's merging of files.

EDIT:
Here are the files and checksums(sha256) that I used to reproduce the issue:

Original file
- HASH: 158575ed12e705a624c3134ffe3138987c64d6a7298c5a81794ccf6866efd488
file merged by script
- HASH: c3e5a0404da480f36d37b65053732abe6d19034f60c3004a908b88d459db7d87
file merged manually using cat
- HASH: 158575ed12e705a624c3134ffe3138987c64d6a7298c5a81794ccf6866efd488
- Command used:
```
for i in /tmp/pycurl_*_{0..7}; do cat $i >> manually_merged.tar.gz; done
```
Part files - numbered at the end, from 0 through 7

I think your `open` mode is not right (`wb`). Based on https://stackoverflow.com/a/4388244/3727050 you need `ab` (or `r+b` and `seek`) — urban, Dec 28 '19 at 16:48
You need to provide a [mre] including some example tempfiles. I think you should be able to reproduce the issue with some tempfiles of just a few bytes each. Hopefully buffer size is not part of the problem. Also binary mode is probably not important, so you could use plain text files. — wjandrea, Dec 28 '19 at 17:05
FWIW I wasn't able to reproduce the problem with two very short text files on Linux unfortunately. — wjandrea, Dec 28 '19 at 17:19
OK, the files help but your code's still incomplete: `filename`, `self.tempfile_arr`, and `shutil` are undefined — wjandrea, Dec 28 '19 at 19:08
It's not the entire script , it's the part which merges the files — Xplore, Dec 28 '19 at 19:34
There are too many things that could go wrong here that your example can't rule out: incomplete downloads, `tempfile_arr` not in the order you claim it is, etc. — chepner, Jan 13 '20 at 14:37
why do you use shutil.copyfileobj`instead of readling and writing (outfile.write(infile.read()))? — 576i, Jan 13 '20 at 14:38
@chepner I do check for the return HTTP code after downloading the part, as I have mentioned, the exact script works flawlessly in windows, but it corrupts the file in linux. — Xplore, Jan 13 '20 at 14:42
@576i -- the `write()` function uses lot of memory for large files, although I have tried with the `write()` function , I get the same issue — Xplore, Jan 13 '20 at 14:44
@576i That's basically what `copyfileobj` does, only it uses a fixed-size buffer to avoid reading the entire source file into memory at once. It's just a loop of repeated `x = src.read(SIZE); dst.write(x)` calls. — chepner, Jan 13 '20 at 14:56
Your two files appear to have the same contents, just in a different order. In other words, you *didn't* merge the pieces in the correct order. — jasonharper, Jan 13 '20 at 15:00
@jasonharper yes, I checked thoroughly and indeed the script was putting parts in different order. But somehow it was working in windows everytime. — Xplore, Jan 17 '20 at 03:08
I can not extract file.txt from the provided automatically_merged.tar.gz without errors. Please re-upload. — Ente, Feb 04 '20 at 21:24
@jasonharper thanks!! I solved it , the order was the problem — Xplore, Dec 07 '20 at 16:39
I go with @urban , I suspect win/linux version of curl or your lib had automatic convert line end or byte order, so yes, pick some smaller one, and, what if files are not been merged or touched, did they have same checksum, and at worse case, you could always dump binary to see what's happending, says xxd or hexeditor — Jack Wu, Jun 01 '21 at 13:22

score 1 · Answer 1 · answered Jan 13 '20 at 16:20

A minimally reproducible case would be convenient, but I'd suspect universal newlines to be the issue: by default, if your files are windows-style text (newlines are \r\n) they're going to get translated to Unix-style newlines (\n) on reading. And then those unix-style newlines are going to get written back to the output file rather than the Windows-style ones you were expecting. That would explain the divergence between python and cat (which'd do no translation whatsoever).

Try to run your script passing newline='' (the empty string) to open.

How to properly handle multiple binary files in python?

1 Answers1