I have been working on a python CLI tool for file encryption recently, for which, I have decided to use PyNaCl library. I have files with a typical size of 200-500 MB. Experimentally, I found out that encrypting the data directly was slower than dividing it into chunks of around 5mb and using a thread pool to encrypt them. I don't know the nuances of concurrency or parallelism, but I want to know what the best and most performant way will be to encrypt large amounts of data.
Here's how my current implementation looks like.
from concurrent.futures import ThreadPoolExecutor
from os import urandom
from typing import Tuple
from nacl import secret
from nacl.bindings import sodium_increment
from nacl.secret import SecretBox
def encrypt_chunk(args: Tuple[bytes, SecretBox, bytes, int]):
chunk, box, nonce, macsize = args
try:
outchunk = box.encrypt(chunk, nonce).ciphertext
except Exception as e:
err = Exception("Error encrypting chunk")
err.__cause__ = e
return err
if not len(outchunk) == len(chunk) + macsize:
return Exception("Error encrypting chunk")
return outchunk
def encrypt(
data: bytes,
key: bytes,
nonce: bytes,
chunksize: int,
macsize: int,
):
box = SecretBox(key)
args = []
total = len(data)
i = 0
while i < total:
chunk = data[i : i + chunksize]
nonce = sodium_increment(nonce)
args.append((chunk, box, nonce, macsize,))
i += chunksize
executor = ThreadPoolExecutor(max_workers=4)
out = executor.map(encrypt_chunk, args)
executor.shutdown(wait=True)
return out
I have been wondering if it would be faster to use multiprocessing than ThreadPoolExecutor(). I even don't know whether the current implementation is the best way to use multithreading, so any advice regarding this is appreciated. Thanks.