1

I am currently experimenting with a Python zipfile password cracker, which is multi-threaded. I thought a new thread is like a new connection/instance, basically getting the job done faster, only when i removed threading and timed the differences I am left shocked with the apparent drop in performance (x6 slower). I will insert the code here in case that's the issue.

import zipfile
from threading import Thread

def extractFile(zFile, password):
    try:
        zFile.extractall(pwd=password)
        print '[+] Password:', password
    except:
        pass

def main():
    zFile = zipfile.ZipFile('encrypted.zip')
    passFile = open('dictionary.txt', 'r')
    for line in passFile.readlines():
        password = line.strip('\n')
        t = Thread(target=extractFile, args=(zFile,password))   
        t.start()

if __name__ == '__main__':
    main()

Once I remove threading it completes 6 times faster. The time results are:

Threaded

real    18m46.974s 
user    18m25.936s   
sys     9m6.872s

Non threaded

real    3m32.674s
user    3m6.400s
sys     0m25.664s

Why is this happening? I expected that using a multi-threaded approach would improve performance.

Christopher Schultz
  • 20,221
  • 9
  • 60
  • 77
user3366103
  • 51
  • 1
  • 1
  • 8
  • "threading is supposed to increase performance" is not true as a general rule, and particularly not in CPython (where the Global Interpreter Lock limits what can be done in parallel). – Charles Duffy May 05 '14 at 19:36
  • ...there are certainly cases where threading *can* increase performance, but they depend very much on the details of your workload and your hardware. It's not magic pixie dust guaranteed to increase performance in any language. – Charles Duffy May 05 '14 at 19:36
  • [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) – shx2 May 05 '14 at 19:36
  • _"I thought a new thread is like a new connection/instance"_. Maybe you're thinking of the `multiprocessing` module, which does spawn new processes, and is not affected by the GIL. – Kevin May 05 '14 at 19:38
  • it's pretty straightforward: in python, `threading` can only speed up I/O bound tasks (like fetching a bunch of webpages). It will only slow down your CPU-bound tasks (like crunching zipfiles, or calculating digits of pi) due to involved overhead. – roippi May 05 '14 at 19:40

1 Answers1

3

There are two issues with this approach:

1) You are spawning N threads, where N is the number of lines in dictionary.txt. Based on the number of lines I'm guessing there are in dictionary.txt, that means you're spawning thousands of threads in a tight loop. Having that many threads running simultaneously is a huge resource drain, because each thread takes up some memory, and your CPU can only actually run a few threads at a time (actually, in Python it can only run one thread at a time, more on that in #2). There is also a cost to actually spawning a thread, and spawning that many is going to slow you down.

2) Because of the GIL, in Python only one thread can actually execute at a time. This the negates the benefit of a multi-core CPU, which should allow you to process multiple threads at a time. You should instead use the multiprocessing module, specifically the Pool class, to parallelize. It will allow you to take advantage of multiple cores, and using a Pool will prevent you from spawning thousands of processes and grinding your system a halt.

dano
  • 91,354
  • 19
  • 222
  • 219
  • thank you, care to update the script so it utilizes multiprocessing? – user3366103 May 05 '14 at 19:49
  • The answer is wrong: The GIL is not a problem. For inflating (the actual uncompressing) multithreading is possible, because it is a pure C function which is bracketed `Py_BEGIN_ALLOW_THREADS` and `Py_END_ALLOW_THREADS`. The problem in your case is, that you try to read **one** file at many random positions at once. This increases your system load and slows down the whole process. – Daniel May 05 '14 at 20:11