12

I'm currently working on a multi-threaded downloader with help of PycURL module. I am downloading parts of the files and merging them afterwards.

The parts are downloaded separately from multiple threads , they are written to temporary files in binary mode, but when I merge them into single file(they are merged in correct order) , the checksums do not match.

This only happens in linux env. The same script works flawlessly in Windows env.

This is the code(part of the script) that merges the files:

with open(filename,'wb') as outfile:
    print('Merging temp files ...')
    for tmpfile in self.tempfile_arr:
        with open(tmpfile, 'rb') as infile:
            shutil.copyfileobj(infile, outfile)
    print('Done!')

I tried write() method as well , but it results with same issue, and it will take a lot of memory for large files.

If I manually cat the part files into a single file in linux, then file's checksum matches, the issue is with python's merging of files.

EDIT:
Here are the files and checksums(sha256) that I used to reproduce the issue:

  • Original file
    • HASH: 158575ed12e705a624c3134ffe3138987c64d6a7298c5a81794ccf6866efd488
  • file merged by script
    • HASH: c3e5a0404da480f36d37b65053732abe6d19034f60c3004a908b88d459db7d87
  • file merged manually using cat

    • HASH: 158575ed12e705a624c3134ffe3138987c64d6a7298c5a81794ccf6866efd488
    • Command used:

      for i in /tmp/pycurl_*_{0..7}; do cat $i >> manually_merged.tar.gz; done
      
  • Part files - numbered at the end, from 0 through 7

Xplore
  • 361
  • 1
  • 2
  • 12
  • 3
    I think your `open` mode is not right (`wb`). Based on https://stackoverflow.com/a/4388244/3727050 you need `ab` (or `r+b` and `seek`) – urban Dec 28 '19 at 16:48
  • 3
    You need to provide a [mre] including some example tempfiles. I think you should be able to reproduce the issue with some tempfiles of just a few bytes each. Hopefully buffer size is not part of the problem. Also binary mode is probably not important, so you could use plain text files. – wjandrea Dec 28 '19 at 17:05
  • FWIW I wasn't able to reproduce the problem with two very short text files on Linux unfortunately. – wjandrea Dec 28 '19 at 17:19
  • Actually pycurl requires binary mode to write data. – Xplore Dec 28 '19 at 18:20
  • 3
    OK, the files help but your code's still incomplete: `filename`, `self.tempfile_arr`, and `shutil` are undefined – wjandrea Dec 28 '19 at 19:08
  • It's not the entire script , it's the part which merges the files – Xplore Dec 28 '19 at 19:34
  • There are too many things that could go wrong here that your example can't rule out: incomplete downloads, `tempfile_arr` not in the order you claim it is, etc. – chepner Jan 13 '20 at 14:37
  • why do you use shutil.copyfileobj`instead of readling and writing (outfile.write(infile.read()))? – 576i Jan 13 '20 at 14:38
  • @chepner I do check for the return HTTP code after downloading the part, as I have mentioned, the exact script works flawlessly in windows, but it corrupts the file in linux. – Xplore Jan 13 '20 at 14:42
  • @576i -- the `write()` function uses lot of memory for large files, although I have tried with the `write()` function , I get the same issue – Xplore Jan 13 '20 at 14:44
  • @576i That's basically what `copyfileobj` does, only it uses a fixed-size buffer to avoid reading the entire source file into memory at once. It's just a loop of repeated `x = src.read(SIZE); dst.write(x)` calls. – chepner Jan 13 '20 at 14:56
  • 1
    Your two files appear to have the same contents, just in a different order. In other words, you *didn't* merge the pieces in the correct order. – jasonharper Jan 13 '20 at 15:00
  • @jasonharper yes, I checked thoroughly and indeed the script was putting parts in different order. But somehow it was working in windows everytime. – Xplore Jan 17 '20 at 03:08
  • I can not extract file.txt from the provided automatically_merged.tar.gz without errors. Please re-upload. – Ente Feb 04 '20 at 21:24
  • @jasonharper thanks!! I solved it , the order was the problem – Xplore Dec 07 '20 at 16:39
  • I go with @urban , I suspect win/linux version of curl or your lib had automatic convert line end or byte order, so yes, pick some smaller one, and, what if files are not been merged or touched, did they have same checksum, and at worse case, you could always dump binary to see what's happending, says xxd or hexeditor – Jack Wu Jun 01 '21 at 13:22

1 Answers1

1

A minimally reproducible case would be convenient, but I'd suspect universal newlines to be the issue: by default, if your files are windows-style text (newlines are \r\n) they're going to get translated to Unix-style newlines (\n) on reading. And then those unix-style newlines are going to get written back to the output file rather than the Windows-style ones you were expecting. That would explain the divergence between python and cat (which'd do no translation whatsoever).

Try to run your script passing newline='' (the empty string) to open.

Masklinn
  • 34,759
  • 3
  • 38
  • 57