Read large zip file (not the files inside) in python

Question

Until know I used this code for reading zip files:

 try:
        with open("asset.zip", "rb") as f:
            bytes_of_file = f.read()
            encoded = base64.b64encode(bytes_of_file)

And it works great then I tried to use large zip files (1GB +), and I got memory error. I tried to use some solution that I saw over the internet:

 with zipfile.ZipFile("asset.zip", "rb") as z:
            with z.open(...) as f:
                 bytes_of_file = f.read()
                 encoded = base64.b64encode(bytes_of_file)

But the problem that zipfile have to open some file inside the zip, and only then I can read it. I want to read the zip file itself and encode it. How can I do it?

Thanks!

Looking this thread https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas — Elinaldo Monteiro, Aug 12 '20 at 15:22
Where is the base64-encoded zip file going? If the file itself doesn't fit in memory, the base64-encoded version of that same file (which is 40% bigger) will not fit either. You can write it to a file, or network connection, in chunks, but not keep it in memory all at the same time. — Thomas, Aug 12 '20 at 15:27
Hi @Thomas, the code is crashing on the read() method. I didn't think about the next step, but writing to file is a good idea, I just need to read the zip first. — user8446864, Aug 12 '20 at 15:31
Must it be done in Python? On my Linux system, I can simply do `base64 asset.zip > asset.zip.b64` on the command line. — Thomas, Aug 12 '20 at 15:33

score 1 · Answer 1 · answered Aug 12 '20 at 15:39

If the file is too large to fit in memory, you will need to stream it little by little to your output file. Open the input file for reading and the output file for writing (both in binary mode). Then read a chunk of some fixed size from the input file, encode it, and write it to the output.

The trick is to choose your chunk size correctly, otherwise base64 will add padding (= characters) at the end of the output chunk which are normally only valid at the end of a base64 encoded byte string. 4 * 6 bits = 24 bits = 3 bytes of input are encoded as 4 full bytes of output without padding, so your chunk size must be a multiple of 3, for example 3 * 1024 * 1024 bytes = 3 MiB.

Thomas is correct. Check out this old post: https://stackoverflow.com/questions/17220370/memory-error-reading-a-zip-file-in-python — Life is complex, Aug 12 '20 at 17:20

Read large zip file (not the files inside) in python

1 Answers1