0

I have a bunch of compressed files (.gz) and I want to merge them into a single one.

I am aware of CL tool:

cat file1.gz file2.gz > file.gz

and also I found this solution from stackoverflow:

with open(..., 'wb') as wfp:
  for fn in filenames:
    with open(fn, 'rb') as rfp:
      shutil.copyfileobj(rfp, wfp)

However, Is there any other way of doing it through Python that is as efficient as cat?

Andrea Grioni
  • 187
  • 1
  • 9
  • 1
    Um, your `zcat` command creates a single *decompressed* file as output. – Charles Duffy Oct 31 '19 at 15:29
  • Do any of the 3 other answers in that linked question not suffice? – Sayse Oct 31 '19 at 15:29
  • 1
    If you want a single *compressed* file as output, just `cat` them with no decompression or recompression happening at all; a compliant gzip decoder treats multiple concatenated compressed streams the same as it would a single stream (this is how "rsyncable" gzip works, resetting the compression table so blocks don't depend on content of blocks prior). – Charles Duffy Oct 31 '19 at 15:30
  • 1
    (your native-Python solution given in the question already does exactly that, concatenating the three compressed streams into one compressed stream with no decompression or recompression taking place anywhere, making it very much *unlike* the `zcat` suggested above it). – Charles Duffy Oct 31 '19 at 15:31
  • ...so it sounds to me like your real question is how to get more efficient file concatenation in Python, with the compression just being a confusing/confounding factor. – Charles Duffy Oct 31 '19 at 15:35
  • yes, I would like to know if there is a more efficient way to concatenate those files with Python rather then open and loop through them. If not, I will use cat. – Andrea Grioni Oct 31 '19 at 15:43
  • Generally speaking, Python is the wrong tool for the job when "faster than C" is a requirement. Especially when it's highly-optimized C like GNU's standard library and coreutils. If you're seeing `cat` be *much* faster than you expect, it may be fun/interesting to read the source. – Charles Duffy Oct 31 '19 at 15:50
  • 1
    ...there are some interesting tricks one can play with modern Linux kernels, telling the kernel to do all the work of copying between file descriptors without getting userland involved at all. Couldn't say if GNU `cat` uses them without looking, but it's believable. And if it does, then you can start by looking into whether anyone's written Python code for accessing those same capabilities. – Charles Duffy Oct 31 '19 at 15:54
  • (There are other tricks that can be played as well that are dependent on filesystem selection and on alignment; if you're on btrfs with the right optional flags, one can tell the filesystem to reference the same blocks from different positions in two different files, meaning there's no need to actually create an extra copy of the same data on disk *at all* to create a second file that appears to contain it). – Charles Duffy Oct 31 '19 at 15:56
  • well..I do not want to go too technical in this moment but I much appreciated your explanations.Your answers opened many interesting topics that I will consider for new projects [e.g. pick up a different programming language if speed matter]. – Andrea Grioni Oct 31 '19 at 16:08

0 Answers0