Why is the python zipfile module faster than C?

Question

I am writing a module that needs to be able to deal with a large number of zip file pretty fast. As such, I was going to use something implemented in C rather than Python (from which I'll be calling the extractor). To try and test which method would be fastest, I wrote a test script comparing linux's 'unzip' command vs the czipfile python module (wrapper around a c zip extractor). As a control, I used the native python zipfile module.

The script creates a zipfile that's around 100MB out of 100 ~1MB files. It looks at 3 scenarios. A) The files are all just random bytestrings. B)The files are just random hex characters C)The files are uniform random sentences with line breaks.

In all cases, the performance of zipfile (implemented in python) was on par with or significantly better than the two extractors implemented in c.

Any ideas why this could be happening? The script is attached. Requires czipfile and the 'unzip' command available in the shell.

from datetime import datetime
import zipfile
import czipfile
import os, binascii, random

class ZipTestError(Exception):
    pass



class ZipTest:

    procs = ['zipfile', 'czipfile', 'os']
    type_map = {'r':'Random', 'h':'Random Hex', 's':'Sentences'}

    # three types. t=='r' is random noise files directly out of urandom. t=='h' is urandom noise converted to ascii characters. t=='s' are randomly constructed sentences with line breaks.
    def __init__(self):
        print """Testing Random Byte Files:
"""
        self.test('r')
        self.test('h')
        self.test('s')




    @staticmethod
    def rand_name():
        return binascii.b2a_hex(os.urandom(10))

    def make_file(self, t): 
        f_name = self.rand_name()
        f = open(f_name, 'w')
        if t == 'r':
            f.write(os.urandom(1048576))
        elif t == 'h':
            f.write(binascii.b2a_hex(os.urandom(1048576)))
        elif t == 's':
            for i in range(76260):
                ops = ['dog', 'cat', 'rat']
                ops2 = ['meat', 'wood', 'fish']
                n1 = int(random.random()*10) % 3
                n2 = int(random.random()*10) % 3
                sentence = """The {0} eats {1}
    """.format(ops[n1], ops2[n2])
                f.write(sentence)
        else:
            raise ZipTestError('Invalid Type')
        f.close()
        return f_name

    #create a ~100MB zip file to test extraction on.
    def create_zip_test(self, t):
        self.file_names = []
        self.output_names = []
        for i in range(100):
            self.file_names.append(self.make_file(t))
        self.zip_name = self.rand_name()
        output = zipfile.ZipFile(self.zip_name, 'w', zipfile.ZIP_DEFLATED)
        for f in self.file_names:
            output.write(f)
        output.close()


    def clean_up(self, rem_zip = False):
        for f in self.file_names:
            os.remove(f)
        self.file_names = []
        for f in self.output_names:
            os.remove(f)
        self.output_names = []
        if rem_zip:
            if getattr(self, 'zip_name', False):
                os.remove(self.zip_name)
            self.zip_name = False

    def display_res(self, res, t):
        print """
{0} results:
""".format(self.type_map[t])
        for p in self.procs:
            print"""
{0} = {1} milliseconds""".format(p, str(res[p]))


    def test(self, t):
        self.create_zip_test(t)
        res = self.unzip()
        self.display_res(res, t)
        self.clean_up(rem_zip = True)


    def unzip(self):
        res = dict()
        for p in self.procs:
            self.clean_up()
            res[p] = getattr(self, "unzip_with_{0}".format(p))()
        return res

    def unzip_with_zipfile(self):
        return self.unzip_with_python(zipfile)

    def unzip_with_czipfile(self):
        return self.unzip_with_python(czipfile)

    def unzip_with_python(self, mod):
        f = open(self.zip_name)
        zf = mod.ZipFile(f)
        start = datetime.now()
        op = './'
        for name in zf.namelist():
            zf.extract(name,op)
            self.output_names.append(name)
        end = datetime.now()
        total = end-start
        ms = total.microseconds
        ms += (total.seconds) * 1000000
        return ms /1000

    def unzip_with_os(self):
        f = open(self.zip_name)
        start = datetime.now()
        zf = zipfile.ZipFile(f)
        for name in zf.namelist():
            self.output_names.append(name)   
        os.system("unzip -qq {0}".format(f.name))
        end = datetime.now()
        total = end-start
        ms = total.microseconds
        ms += (total.seconds) * 1000000
        return ms /1000






if __name__ == '__main__':
    ZipTest()

"zipfile (implemented in python)" Only partially. Keep digging. — Ignacio Vazquez-Abrams, Sep 15 '15 at 00:55
It says in the docs that decryption is pure python. Second paragraph: https://docs.python.org/2/library/zipfile.html — joek575, Sep 15 '15 at 00:58

score 1 · Answer 1 · answered Sep 15 '15 at 01:03

1

As was pointed out above, the decryption is done in python, not the decompression. So zipfile is just using a c implementation like the other two.

answered Sep 15 '15 at 01:03

joek575

561
5
9

score 1 · Answer 2 · answered May 25 '17 at 15:30

Even if C is generally faster than interpreted languages, assuming the algorithm is the same, different buffering strategies can make a difference. Here's some evidence:

I made a couple of changes to your script. The diff is below.

I made the stopwatch start just before os.system. This change is not noticeable, since reading off entries from the Central Directory is quick. So I saved the zip files and measured unzip time with the time shell builtin, outside Python. The result shows that the overhead of firing up new processes doesn't matter so much.

A more interesting change is the addition of libarchive. The results I obtained are like so (milliseconds):

             Random     Hex  Sentences
zipfile         368    1909        604
czipfile        241    1600       2313
os              707    2225        784
shell-measured  797    2272        737
libarchive      248    1513        451
         EXTRACTION METHOD

Note that results vary by some milliseconds every time. The shell measures real, user, and sys time (see What do 'real', 'user' and 'sys' mean in the output of time(1)?). The figures above reflect real time, for consistency with other measurements.

A better analysis of what system calls unzip issues can be achieved by strace -c -w. It shows a spike of reads for Hex:

             Random     Hex  Sentences
read            805   14597      12816
write          2600    3200       1600
         SYSTEM CALLS ISSUED BY unzip

Now for the diff (it assumes the original script is named ziptest.py in the same directory where you run patch < _diff_, see patch, diff)

--- ziptest.py.orig 2017-05-25 10:36:03.106994889 +0200
+++ ziptest.py  2017-05-25 11:30:42.032598259 +0200
@@ -2,6 +2,7 @@
 import zipfile
 import czipfile
 import os, binascii, random
+import libarchive.public

 class ZipTestError(Exception):
     pass
@@ -10,7 +11,7 @@

 class ZipTest:

-    procs = ['zipfile', 'czipfile', 'os']
+    procs = ['zipfile', 'czipfile', 'os', 'libarchive']
     type_map = {'r':'Random', 'h':'Random Hex', 's':'Sentences'}

     # three types. t=='r' is random noise files directly out of urandom. t=='h' is urandom noise converted to ascii characters. t=='s' are randomly constructed sentences with line breaks.
@@ -119,10 +120,10 @@

     def unzip_with_os(self):
         f = open(self.zip_name)
-        start = datetime.now()
         zf = zipfile.ZipFile(f)
         for name in zf.namelist():
             self.output_names.append(name)   
+        start = datetime.now()
         os.system("unzip -qq {0}".format(f.name))
         end = datetime.now()
         total = end-start
@@ -130,7 +131,15 @@
         ms += (total.seconds) * 1000000
         return ms /1000

-
+    def unzip_with_libarchive(self):
+        start = datetime.now()
+        for entry in libarchive.public.file_pour(self.zip_name):
+            self.output_names.append(str(entry))
+        end = datetime.now()
+        total = end-start
+        ms = total.microseconds
+        ms += (total.seconds) * 1000000
+        return ms /1000

Interesting. Definitely a bit more focused on the os mechanics than the algo itself. But i suppose that is where the majority of the difference normally comes from in these cases. — joek575, Nov 13 '18 at 06:15
Yeah, Info-ZIP produced 16bit versions of unzip, where small buffers were a must. 64bit programs can read whole files in memory at once, relying on virtual memory —a feature introduced much later. — Ale, Nov 15 '18 at 12:05

Why is the python zipfile module faster than C?

2 Answers2