2

I have made a simple app that encrypts and decrypts files. but when i load a large file like a 2gb, my program uses 100% of the memory. I use multiprocessing and multi threading.

poolSize = min(cpu_count(), len(fileList))
process_pool = Pool(poolSize)
thread_pool = ThreadPool(len(fileList))

lock = Lock()
worker = partial(encfile, process_pool, lock)

thread_pool.map(worker, fileList)
def encfile(process_pool, lock, file):
    with open(file, 'rb') as original_file:
        original = original_file.read()

    encrypted = process_pool.apply(encryptfn, args=(key, original,))

    with open (file, 'wb') as encrypted_file:
        encrypted_file.write(encrypted)
manis
  • 53
  • 4
  • 1
    Instead of reading the whole file into memory and then writing it back, you can read, process, and write in smaller chunks (64 kB perhaps?) by specifying the number of bytes in read and write (check this for an example : [python file reading and writing](https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects). In case your encryption algorithm doesn't support chunk style processing you could consider using a stream cipher like [ChaCha20](https://pycryptodome.readthedocs.io/en/latest/src/cipher/chacha20.html) – mew Dec 16 '21 at 12:28
  • Iam using [fernet](https://cryptography.io/en/latest/fernet/) from [cryptography](https://cryptography.io/en/latest/) library – manis Dec 16 '21 at 13:14
  • What is your actual question? – Ulrich Eckhardt Dec 16 '21 at 13:57
  • @UlrichEckhardt when i encrypt files my app consumes 100 percent of the memory – manis Dec 16 '21 at 14:02
  • That's not a question; it's a statement of fact. Either buy more memory or rewrite your code to *not* read the entire file into memory. If you have questions about how to do the latter, rewrite your question to focus on a *specific* problem you are having. – chepner Dec 16 '21 at 14:31
  • 1
    Another question has a similar problem which has an [answer about fernet not being a good choice for encrypting large files](https://stackoverflow.com/a/69313565/13229013). You can check out the ChaCha20 stream cipher for encrypting large files, in small blocks. [This answer, about encrypting large files in blocks, might be a good starting point](https://stackoverflow.com/a/55220493/13229013). – mew Dec 16 '21 at 14:37

1 Answers1

1

This is my general idea:

Since memory is a problem, you have to read the files in smaller chunks, say 64K pieces and encrypt each 64K block and write those out. Of course, the encrypted block will have a length other than 64K so the problem becomes how to decrypt. So each encrypted block must be prefixed with a fixed-length header that is nothing more than the length of the following encrypted block encoded as a 4-byte unsigned integer (which should be way more than adequate). The decryption algorithm loop first reads the next 4-byte length and then know from that how many bytes long is the encrypted block that follows.

By the way, there is no need to pass to encfile a lock if you are not using it to, for example, count files processed.

from tempfile import mkstemp
from os import fdopen, replace


BLOCKSIZE = 64 * 1024
ENCRYPTED_HEADER_LENGTH = 4

def encfile(process_pool, lock, file):
    """
    Encrypt file in place.
    """

    fd, path = mkstemp() # make a temporary file

    with open(file, 'rb') as original_file, \
    fdopen (fd, 'wb') as encrypted_file:
        while True:
            original = original_file.read(BLOCKSIZE)
            if not original:
                break
            encrypted = process_pool.apply(encryptfn, args=(key, original))
            l = len(encrypted)
            l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big')
            encrypted_file.write(l_bytes)
            encrypted_file.write(encrypted)
    replace(path, file)


def decfile(file):
    """
    Decrypt files in place.
    """

    fd, path = mkstemp() # make a temporary file

    with open(file, 'rb') as encrypted_file, \
    fdopen (fd, 'wb') as original_file:
        while True:
            l_bytes = encrypted_file.read(ENCRYPTED_HEADER_LENGTH)
            if not l_bytes:
                break
            l = int.from_bytes(l_bytes, 'big')
            encrypted = encrypted_file.read(l)
            decrypted = decryptfn(key, encrypted)
            original_file.write(decrypted)
    replace(path, file)

Explanation

The larger the block size the more memory is required (your original program read the entire file; this program will only read 64K at a time). But I am assuming that too small a block size results in too many calls to the encryption, which is done by multiprocessing and that would require more CPU overhead -- so it's a tradeoff. 64K was arbitrary. Increase by a lot if you have the memory. You can even try 1024 * 1024 (1M).

I attempted to explain this the following before, but let me elaborate:

So let's say when you encrypt a 64K block then encrypted size for that one particular 64K block ends up being 67,986 bytes long (a different 64K block encrypted will in general have a different length unless its unencrypted value happened to have been the same). If I just write out the data with no other information, I would need some way to know that to decrypt the file that it is first necessary to read back 67,986 bytes of data and pass that to the decrypt method (with the correct key, of course) because you have to decrypt the precise results of what was encrypted, no fewer nor no greater bytes. In other words, you can't just read back the encrypted file in arbitrary chunks and pass those chunks to the decrypt method. But what would be that way? So the only way to know how big each encrypted chunk is would be to prefix those chunks with a header that gives the length of the following chunk.

l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big') takes the integerlength stored in variable l and encodes it as a byte array of size ENCRYPTED_HEADER_LENGTH in "big endian" order meaning that the bytes are arranged from high order bytes to low order bytes:

>>> ENCRYPTED_HEADER_LENGTH = 4
>>> l = 67986
>>> l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big')
>>> l_bytes
b'\x00\x01\t\x92'
>>> l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'little')
>>> l_bytes
b'\x92\t\x01\x00'
>>>

\t is the tab character with a value of \x09 so we would be writing out 0010992, which is a 4-byte hexadecimal value for 67986

Booboo
  • 38,656
  • 3
  • 37
  • 60