multithreading or multiprocessing for encrypting multiple files

Question

i have created a function enc()

def enc():
    password = bytes('asd123','utf-8')
    salt = bytes('asd123','utf-8')
    kdf = PBKDF2HMAC(
        algorithm=hashes.SHA256(),
        length=32,
        salt=salt,
        iterations=10000,
        backend=default_backend())
    key = base64.urlsafe_b64encode(kdf.derive(password))
    f = Fernet(key)

    for file in files:
        with open(file,'rb') as original_file:
            original = original_file.read()

        encrypted = f.encrypt(original)

        with open (file,'wb') as encrypted_file:
            encrypted_file.write(encrypted)

which loops through every file from files and encrypts it.

files = ['D:/folder/asd.txt',
          'D:/folder/qwe.mp4',
          'D:/folder/qwe.jpg']

I wanna use multithreading or multiprocessing to make it faster. Is it possible? Need some help with code.

I tried Multithreading

thread = threading.Thread(target=enc)
thread.start()
thread.join()

But it doesn't seem it improve the speed or time. I need some help implementing multiprocessing. Thanks.

Have you tried with multiprocessing ? - https://docs.python.org/3/library/multiprocessing.html — Luka Rahne, Nov 07 '21 at 11:36
FYI: Using the same key to encrypt more than one "message" (i.e., more than one _file_ in this case) is considered to be an insecure practice. (Read about "[Differential Cryptanalysis](https://en.wikipedia.org/wiki/Differential_cryptanalysis).") Of course, whether or not it actually is a bad idea depends on the _threat model._ Against whom are you trying to protect the information? Are they capable of _doing_ differential cryptanalysis? Would it be worthwhile for them to spend the time or, to hire professionals to do it for them? — Solomon Slow, Nov 07 '21 at 13:31
Multithreading generally won't give you performance increases for computationally intensive processes (unless the threading model can use multiple cores, which I don't think Python does). You probably need multiprocessing to do that. — RufusVS, Nov 07 '21 at 15:38

Booboo · Accepted Answer · 2021-11-08T12:09:55.630

Threading is not the best candidate for tasks that are CPU intensive unless the task is being performed, for example, by a C-language library routine that releases the Global Interpreter Lock. In any event, you certainly will get any performance gains with multithreading or multiprocessing unless you run multiple processes in parallel.

Let's say you have N tasks and M processor to process the tasks. If the tasks were pure CPU with no I/O (not exactly your situation), there would be no advantage in starting more than M processes to work on your N tasks and for this a multiprocessing pool is the ideal situation. When there is a mix of CPU and I/O, it could be advantageous to have a pool size greater than M, even possibly as large as N if there is a lot of I/O and very little CPU. But in that case it would be better to actually use a combination of a multithreading pool and a multiprocessing pool (of size M) where the multithreading pool was used for all of the I/O work and the multiprocessing pool for the CPU computations. The following code shows that technique:

from multiprocessing.pool import Pool, ThreadPool
from multiprocessing import cpu_count
from functools import partial

def encrypt(key, b):
    f = Fernet(key)
    return f.encrypt(b)

def enc(key, process_pool, file):
    with open(file,'rb') as original_file:
        original = original_file.read()

    encrypted = process_pool.apply(encrypt, args=(key, original,))

    with open (file,'wb') as encrypted_file:
        encrypted_file.write(encrypted)


def main():
    password = bytes('asd123','utf-8')
    salt = bytes('asd123','utf-8')
    kdf = PBKDF2HMAC(
        algorithm=hashes.SHA256(),
        length=32,
        salt=salt,
        iterations=10000,
        backend=default_backend())
    key = base64.urlsafe_b64encode(kdf.derive(password))

    files = ['D:/folder/asd.txt',
              'D:/folder/qwe.mp4',
              'D:/folder/qwe.jpg']

    # Too many threads may be counter productive due to disk contention
    # Should MAX_THREADS be unlimited?
    # For a solid-state drive with no physical arm movement,
    # an extremely large value, e.g. 500, probably would not hurt.
    # For "regular" drives, one needs to experiment
    MAX_THREADS = 500 # Essentially no limit
    # compute number of processes in our pool
    # the lesser of number of files to process and the number of cores we have:
    pool_size = min(MAX_THREADS, cpu_count(), len(files))
    # create process pool:
    process_pool = Pool(pool_size)
    # create thread pool:
    thread_pool = ThreadPool(len(files))
    worker = partial(enc, key, process_pool)
    thread_pool.map(worker, files)

if __name__ == '__main__':
    main()

Comment

Anyway, the point is this: Let's say you had 30 files and 4 cores instead of 3 files. The solution posted by @anarchy would be starting 30 processes and computing f 30 times but could really only utilize effectively 4 processors for the parallel computation of f and for doing the encryption. My solution would use 30 threads for doing the I/O but only start 4 processes thus computing f only 4 times. You save creating 26 processes and 26 computations of f that are useless.

It might even be better to have fewer than 30 threads unless you had a solid state drive since all your threads are contending against the same drive and (1) Each file may be located in a totally different location on the drive and performing concurrent I/O against such files could be counter-productive and (2) There is some maximum throughput that can be achieved by any particular drive.

So perhaps we should have:


    thread_pool = ThreadPool(min(len(files), MAX_THREADS))

where MAX_THREADS is set to some maximum value suitable for your particular drive.

Update

Now the expensive compuation of key is only done once.

The OP's New Problem Running With TKinter

Actually you have two problems. Not only are multiple windows being opened, but you are probably also getting a pickle error trying to call the multiprocessing worker function encrypt because such functions must be defined at global scope and not be nested within another function as you have done.

On platforms that use method spawn to create new processes, such as Windows, to create and initialize each processes in the pool that is created with your process_pool = Pool(pool_size) statement, a new, empty address space is created and a new Python interpreter is launched that re-reads and re-executes the source program in order to initialize the address space before ultimately calling the worker function test. That means that every statement at global scope, i.e. import statements, variable declarations, function declarations, etc., are executed for this purpose. However, in the new subprocess variable __name__ will not be '__main__' so any statements within an if __name__ == '__main__' : block at global scope will not be executed. By the way, that is why for Windows platforms code at global scope that ultimately results in creating new processes is placed within such a block. Failure to do so would result in an infinite recursive process-creation loop if it were to go otherwise undetected. But you placed such a check on __name__ within a nested function where it serves no purpose.

But realizing that all statements at global scope will be executed as part of the initialization of every process in a multiprocessing pool, ideally you should only have at global scope those statements that are required for the initialization of those processes or at least "harmless" statements, i.e. statements whose presence are not overly costly to be executing or have no unpleasant side-effects. Harmful statements should also be placed within an if __name__ == '__main__' : block or moved to within a function.

It should be clear now that the statements you have that create the main window are "harmful" statements that you do not want executed by each newly created process. The tail end of your code should be as follows (I have also incorporated a MAX_THREADS constant to limit the maximum number of threads that will be created although here it is set arbitrarily large -- you should experiment with much smaller values such as 3, 5, 10, 20, etc. to see what gives you the best throughput):

def passerrorbox():
    tk.messagebox.showerror('Password Error','Enter a Password')
    fipasswordbox.delete(0,'end')
    fisaltbox.delete(0,'end')
    filistbox.delete(0,'end')

# Changes start here:

# Get rid of all nesting of functions:
def encrypt(key, a):
    f = Fernet(key)
    return f.encrypt(a)

def enc(key, process_pool, file):
    # File Encryption
    with open(file,'rb') as original_file:
        original = original_file.read()

    encrypted = process_pool.apply(encrypt, args=(key, original,))

    with open (file,'wb') as encrypted_file:
        encrypted_file.write(encrypted)

def encfile(): # was previously named main
    password = bytes(fipasswordbox.get(), 'utf-8')
    salt = bytes(fisaltbox.get(),'utf-8')
    fileln = filistbox.get(0,'end')

    if len(fileln) == 0:
        fierrorbox()
    elif len(password) == 0:
        passerrorbox()
    else:
        file_enc_button['state']='disabled'
        browsefi['state']='disabled'

        fipasswordbox['state']='disabled'
        fisaltbox['state']='disabled'

        kdf = PBKDF2HMAC(
            algorithm=hashes.SHA256(),
            length=32,
            salt=salt,
            iterations=10000,
            backend=default_backend())
        key = base64.urlsafe_b64encode(kdf.derive(password))

        # Too many threads may be counter productive due to disk contention
        # Should MAX_THREADS be unlimited?
        # For a solid-state drive with no physical arm movement,
        # an extremely large value, e.g. 500, probably would not hurt.
        # For "regular" drives, one needs to experiment
        MAX_THREADS = 500 # Essentially no limit
        pool_size = min(MAX_THREADS, cpu_count(), len(fileln))
        process_pool = Pool(pool_size)
        thread_pool = ThreadPool(min(MAX_THREADS, len(fileln)))
        worker = partial(enc, key, process_pool)
        thread_pool.map(worker, fileln)

        fiencdone()

if __name__ == '__main__':
    root = tk.Tk()
    fileframe()
    root.mainloop()

Either `concurrent.futures` or `multiprocessing.pool` if you omit the *number of processes* (not *pools*) to use, will use by default `multiprocessing.cpu_count()` (or `os.cpu_count()` -- the same thing,) i.e. the number of logical cpu cores that you have). This could be overkill if, for example you have 8 cores but only 3 files in your list. You would be creating 5 processes that would never be doing any work and creating processes, especially on Windows is *expensive*. That is why I use the `min` function. — Booboo, Nov 07 '21 at 13:26
Iam getting an error while running your code - TypeError: cannot pickle '_cffi_backend.FFI' object — Sanket, Nov 07 '21 at 14:29
That doesn't tell me too much. Post the error message and a stack trace if an exception is being thrown. I don't have your files or all your installed packages/modules so I couldn't really test this. — Booboo, Nov 07 '21 at 14:31
It seems that variable `f` cannot be passed from one address space to another (from one process to another. So `f` has to be computed by each process in the pool. I have updated the code to do that. Hopefully this will work. — Booboo, Nov 07 '21 at 14:55
I updated the source again to compute `f` at global scope as it was originally rather than use a pool initializer function. This makes no difference really for Windows. But on platforms such as Linux that uses *fork* to create new processes, then `f` will be computed only once regardless of how many processes there are in the pool. — Booboo, Nov 07 '21 at 15:27
It is not the computation of `f` that is expensive but rather the computation of `key`. `key` is only 32 bytes and may safely and efficiently be passed to each process. I'm not completely sure about thread/process safety in python so I'm wary of having a single global instance of `f`. — President James K. Polk, Nov 07 '21 at 15:29
@PresidentJamesK.Polk `f` cannot be pickled as we found out so each process needs to compute it. This can be done using a pool initializer or as above. On Windows each process in the pool will be re-executing the code that ultimately assigns a value to `f` so each address space will have its own global variable `f`. So thread/process safety is not an issue. — Booboo, Nov 07 '21 at 15:57
Ok, I think I understand. It might be worth it to compute `key` once before all the multiprocessing work begins because the PBKDF runs for 10000 iterations. — President James K. Polk, Nov 07 '21 at 16:01
@PresidentJamesK.Polk I took your suggestion. See the updated answer. — Booboo, Nov 07 '21 at 16:05
ok, thanks @Booboo for the code but now when i try to run the code using tkinter it spawns new windows. You can check my code here - https://codeshare.io/9O1Pv4 — Sanket, Nov 08 '21 at 08:43
This should probably have been a new question since it addresses separate issues, but I have updated the answer. — Booboo, Nov 08 '21 at 12:11
@Booboo yeah it should have been a different question, sorry abt that. The code works smoothly, only problem now is that while the encryption process is happening the whole tkinter app lags and shows Not Responding but it completes the task. — Sanket, Nov 08 '21 at 14:59

anarchy · Answer 2 · 2021-11-07T12:03:22.163

0

You need to rework your function.

Python isn’t smart enough to know which part of the code you need multiprocessed.

Most likely it’s the for loop right, you want to encrypt the files in parallel. So you can try something like this.

Define the function which needs to be run for each loop, then, create the for loop outside. Then use multiprocessing like this.

import multiprocessing

password = bytes('asd123','utf-8')
salt = bytes('asd123','utf-8')
    kdf = PBKDF2HMAC(
        algorithm=hashes.SHA256(),
        length=32,
        salt=salt,
        iterations=10000,
        backend=default_backend())
key = base64.urlsafe_b64encode(kdf.derive(password))
f = Fernet(key)

def enc(file):
    with open(file,'rb') as original_file:
        original = original_file.read()

    encrypted = f.encrypt(original)

    with open (file,'wb') as encrypted_file:
        encrypted_file.write(encrypted)
    

if __name__ == '__main__':
    jobs = []
    for file in files:
        p = multiprocessing.Process(target=enc, args=(file,))
        jobs.append(p)
        p.start()

edited Nov 07 '21 at 12:03

answered Nov 07 '21 at 11:52

anarchy

3,709
2
16
48

Its way faster now thanks, but i have a question for reading, writing to a file and encrypting the file is multiprocess the way to go or multithreading. – Sanket Nov 07 '21 at 13:01
Well I’m not sure about this, all I know is how to code it haha, but this should answer your question. https://stackoverflow.com/questions/18114285/what-are-the-differences-between-the-threading-and-multiprocessing-modules – anarchy Nov 07 '21 at 13:05
From what I gather, you need to figure out if your processes share the same memory, if they do, use threads, but if you don’t then it’s more efficient to use multiprocesssing – anarchy Nov 07 '21 at 13:06
Ok np, you helped me a lot anyways. – Sanket Nov 07 '21 at 13:10
From what I understand though, since you are not using the same memory in each loop, you shouldn’t use threading in this instance. – anarchy Nov 07 '21 at 13:14
If you ware running this on Windows, the the calculation of `f` will be re-computed needlessly (and never used) by the 3 processes that are created by your main process. See my answer. – Booboo Nov 07 '21 at 13:31
The f process is only created once in mine. – anarchy Nov 07 '21 at 13:37
It's not a process, it is a calculation. Under Windows or any platform that uses *spawn* to create new processes, *any* code at global scope that is not within a `if __name__ == '__main__':` block is executed as part of the initialization of new processes. That is why process-creation code must be within such a block on Windows. – Booboo Nov 07 '21 at 13:46
Ahhh okay I think I understand – anarchy Nov 07 '21 at 13:54
Although I just discovered that if you move the calculation of `f` to inside the block so it is calculated just once, it would have to be passed as an argument on your `Process` statement but, unfortunately, `f` cannot be serialized/deserialized (pickled) to allow you to do that. So where it is being calculated (multiple times) is where it needs to stay *in this case*. – Booboo Nov 07 '21 at 14:59

multithreading or multiprocessing for encrypting multiple files

2 Answers2