4

Is there a memory-efficient way to concatenate gzipped files, using Python, on Windows, without decompressing them?

According to a comment on this answer, it should be as simple as:

cat file1.gz file2.gz file3.gz > allfiles.gz

but how do I do this with Python, on Windows?

Community
  • 1
  • 1
BioGeek
  • 21,897
  • 23
  • 83
  • 145

4 Answers4

8

Just keep writing to the same file.

with open(..., 'wb') as wfp:
  for fn in filenames:
    with open(fn, 'rb') as rfp:
      shutil.copyfileobj(rfp, wfp)
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
1

You don't need python to copy many files to one. You can use standard Windows "Copy" for this:

copy file1.gz /b + file2.gz /b + file3.gz /b allfiles.gz

Or, simply:

copy *.gz /b allfiles.gz

But, if you wish to use Python, Ignacio's answer is a better option.

Erik A. Brandstadmoen
  • 10,430
  • 2
  • 37
  • 55
  • You forgot the `+`s and the `/b`s. – Ignacio Vazquez-Abrams Aug 13 '13 at 12:28
  • 1
    You have an extraneous `+` before `allfiles.gz` that will cause `file1.gz` to get overwritten. "If you omit Destination, the files are combined and stored under the name of the first file in the list." [source](https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/copy) – Stefan Jan 30 '20 at 16:20
  • You are right. Thanks for the attention to detail. Than said, understanding the meaning of /b before and after files is really difficult to understand.. – Erik A. Brandstadmoen Jan 31 '20 at 16:38
1

If

cat file1.gz file2.gz file3.gz > allfiles.gz

works, then this should work too:

fileList = ['file1.gz', 'file2.gz', 'file3.gz']
destFilename = 'allfiles.gz'

bufferSize = 8  # Adjust this according to how "memory efficient" you need the program to be.

with open(destFilename, 'wb') as destFile:
    for fileName in fileList:
        with open(fileName, 'rb') as sourceFile:
            chunk = True
            while chunk:
                chunk = sourceFile.read(bufferSize)
                destFile.write(chunk)
Brionius
  • 13,858
  • 3
  • 38
  • 49
0

Fortunately, gzipped files can be directly concatenated via the cat CL command, but unfortunately there doesn't seem to be an obvious python command to do this (in the standard library gzip anyways). However, I only looked briefly. There are probably libraries out there to accomplish this.

Nonetheless, a way to accomplish this using the standard library is to call cat using subprocess:

from subprocess import check_call
command = "cat {} {} > {}".format(file1_path, file2_path, output_name)
check_call(command.split())  # Check call takes a list

To generalize this to arbitrary numbers of inputs, you can do:

inputs = ['input1', 'input2', ... 'input9001']
output_name = 'output.gz'

command = "".join(['cat ', '{} ' * len(inputs), '> {out}'])
_call_ = command.format(*inputs, out=output_name).split()

check_call(_call_)

I hope that is helpful to someone.

PaulG
  • 3,260
  • 1
  • 15
  • 13