I'm currently working on a multi-threaded downloader with help of PycURL module. I am downloading parts of the files and merging them afterwards.
The parts are downloaded separately from multiple threads , they are written to temporary files in binary mode, but when I merge them into single file(they are merged in correct order) , the checksums do not match.
This only happens in linux env. The same script works flawlessly in Windows env.
This is the code(part of the script) that merges the files:
with open(filename,'wb') as outfile:
print('Merging temp files ...')
for tmpfile in self.tempfile_arr:
with open(tmpfile, 'rb') as infile:
shutil.copyfileobj(infile, outfile)
print('Done!')
I tried write()
method as well , but it results with same issue, and it will take a lot of memory for large files.
If I manually cat
the part files into a single file in linux, then file's checksum matches, the issue is with python's merging of files.
EDIT:
Here are the files and checksums(sha256) that I used to reproduce the issue:
- Original file
- HASH: 158575ed12e705a624c3134ffe3138987c64d6a7298c5a81794ccf6866efd488
- file merged by script
- HASH: c3e5a0404da480f36d37b65053732abe6d19034f60c3004a908b88d459db7d87
file merged manually using cat
- HASH: 158575ed12e705a624c3134ffe3138987c64d6a7298c5a81794ccf6866efd488
Command used:
for i in /tmp/pycurl_*_{0..7}; do cat $i >> manually_merged.tar.gz; done
Part files - numbered at the end, from 0 through 7