11

I need to figure out how to write file output to a compressed file in Python, similar to the two-liner below:

open ZIPPED, "| gzip -c > zipped.gz";
print ZIPPED "Hello world\n";

In Perl, this uses Unix gzip to compress whatever you print to the ZIPPED filehandle to the file "zipped.gz".

I know how to use "import gzip" to do this in Python like this:

import gzip
zipped = gzip.open("zipped.gz", 'wb')
zipped.write("Hello world\n")

However, that is extremely slow. According to the profiler, using that method takes up 90% of my run time since I am writing 200GB of uncompressed data to various output files. I am aware that the file system could be part of the problem here, but I want to rule it out by using Unix/Linux compression instead. This is partially because I have heard that decompressing using this same module is slow as well.

aganders3
  • 5,838
  • 26
  • 30
bu11d0zer
  • 433
  • 4
  • 13
  • 1
    Do you need it done in pure Python, or could you settle for a call into a binary on your filesystem (in Python, you'd use the subprocess module) ? – ChristopheD Nov 28 '11 at 22:25
  • I prefer not to do it in Python since pure Python methods are too slow. – bu11d0zer Nov 28 '11 at 23:16
  • Have you run the gzip program from the shell on your 200GB of uncompressed data? I would expect that to take quite a bit of wallclock time at 90-100% CPU utilization - on my Windows box it runs about 1 minute per GB, whereas the Python gzip module takes about 2 minutes per GB. – Dave Nov 28 '11 at 23:37
  • Dave, yeah it's the difference between the 2 minutes and 1 minute that I am going after. – bu11d0zer Nov 28 '11 at 23:46

5 Answers5

10

ChristopheD's suggestion of using the subprocess module is an appropriate answer to this question. However, it's not clear to me that it will solve your performance problems. You would have to measure the performance of the new code to be sure.

To convert your sample code:

import subprocess

p = subprocess.Popen("gzip -c > zipped.gz", shell=True, stdin=subprocess.PIPE)
p.communicate("Hello World\n")

Since you need to send large amounts of data to the sub-process, you should consider using the stdin attribute of the Popen object. For example:

import subprocess

p = subprocess.Popen("gzip -c > zipped.gz", shell=True, stdin=subprocess.PIPE)
p.stdin.write("Some data")

# Write more data here...

p.communicate() # Finish writing data and wait for subprocess to finish

You may also find the discussion at this question helpful.

Community
  • 1
  • 1
srgerg
  • 18,719
  • 4
  • 57
  • 39
  • I verified that this method is 33% faster on a 1GB highly compressible file. That's a nice improvement compared to gzip.open. Here's the code I used to test it: import subprocess text = "fjlaskfjioewru oijf alksfjlkqs jr jweqoirjwoiefjlkadsfj afjf\n" for i in xrange(1,25): text += text p = subprocess.Popen("gzip -c > zipped.gz", shell=True, stdin=subprocess.PIPE)` p.stdin.write(text) p.communicate() Time for gzip.open: 12.109u 1.194s 0:13.37 99.4% 0+0k 0+0io 0pf+0w Time for the above code: 8.379u 2.602s 0:10.17 107.8% 0+0k 0+0io 0pf+0w – bu11d0zer Nov 28 '11 at 23:45
  • Be sure to accept your favorite answer :-). We all like the extra rep. – Dave Nov 28 '11 at 23:58
  • For the curious, the run time of a large test dropped from 6h43m to 4h31m when I used this method as opposed to gzip.open. This was apples-to-apples on the same machine. This is about 33% faster, which is exactly what I saw in the smaller test case. Thanks everyone! – bu11d0zer Nov 29 '11 at 08:14
  • 1
    @bu11d0zer: you should use pastebin for that kind of thing: http://pastebin.com/2kZHsbFH – bukzor Nov 29 '11 at 19:54
6

Try something like this:

from subprocess import Popen, PIPE
f = open('zipped.gz', 'w')
pipe = Popen('gzip', stdin=PIPE, stdout=f)
pipe.communicate('Hello world\n')
f.close()
Moishe Lettvin
  • 8,462
  • 1
  • 26
  • 40
2

Using the gzip module is the official one-way-to-do-it and it's unlikely that any other pure python approach will go faster. This is especially true because the size of your data rules out in-memory options. Most likely, the fastest way is to write the full file to disk and use subprocess to call gz on that file.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
1

Make sure you use the same compression level when comparing speeds. By default, linux gzip uses level 6, while python uses level 9. I tested this in Python 3.6.8 using gzip version 1.5, compressing 600MB of data from MySQL dump. With default settings:

python module takes 9.24 seconds and makes a file 47.1 MB
subprocess gzip takes 8.61 seconds and makes a file 48.5 MB

After changing it to level 6 so they match:
python module takes 8.09 seconds and makes a file 48.6 MB
subprocess gzip takes 8.55 seconds and makes a file 48.5 MB

# subprocess method
start = time.time()
with open(outfile, 'wb') as f:
    subprocess.run(['gzip'], input=dump, stdout=f, check=True)
print('subprocess finished after {:.2f} seconds'.format(time.time() - start))

# gzip method
start = time.time()
with gzip.open(outfile2, 'wb', compresslevel=6) as z:
    z.write(dump)
print('gzip module finished after {:.2f} seconds'.format(time.time() - start))
Elliott B
  • 980
  • 9
  • 32
0

In addition to @srgerg's answer I want to apply same approach by disabling shell option shell=False, which is also done on @Moishe Lettvin's answer and recommended on (https://stackoverflow.com/a/3172488/2402577).

import subprocess
def zip():
    f = open("zipped.gz", "w")
    p1 = subprocess.Popen(["echo", "Hello World"], stdout=subprocess.PIPE)
    p2 = subprocess.Popen(["gzip", "-9c"], stdin=p1.stdout, stdout=f)
    p1.stdout.close()
    p2.communicate()
    f.close()

Please not that originally I am using this p1s output for git diff as:

p1 = subprocess.Popen(["git", "diff"], stdout=subprocess.PIPE)

alper
  • 2,919
  • 9
  • 53
  • 102