How to count files inside zip in AWS S3 without downloading it?

Question

Case: There is a large zip file in an S3 bucket which contains a large number of images. Is there a way without downloading the whole file to read the metadata or something to know how many files are inside the zip file?

When the file is local, in python i can just open it as a zipfile() and then I call the namelist() method which returns a list of all the files inside, and I can count that. However not sure how to do this when the file resides in S3 without having to download it. Also if this is possible with Lambda would be best.

Have a look at [the listing logic used here](https://stackoverflow.com/questions/51351000/read-zip-files-from-s3-without-downloading-the-entire-file/52455004#52455004) if you want to do this with the minimal possible bandwidth usage (_might_ be optimizable a bit more, though ;)) — Janaka Bandara, Sep 22 '18 at 08:51
There is a [github](https://github.com/hkutluay/S3ZipContent/) project for anyone needs on .net environment. — hkutluay, Apr 21 '19 at 09:16

score 10 · Accepted Answer · edited May 10 '19 at 21:02

I think this will solve your problem:

import zlib
import zipfile
import io

def fetch(key_name, start, len, client_s3):
    """
    range-fetches a S3 key
    """
    end = start + len - 1
    s3_object = client_s3.get_object(Bucket=bucket_name, Key=key_name, Range="bytes=%d-%d" % (start, end))
    return s3_object['Body'].read()


def parse_int(bytes):
    """
    parses 2 or 4 little-endian bits into their corresponding integer value
    """
    val = (bytes[0]) + ((bytes[1]) << 8)
    if len(bytes) > 3:
        val += ((bytes[2]) << 16) + ((bytes[3]) << 24)
    return val


def list_files_in_s3_zipped_object(bucket_name, key_name, client_s3):
    """

    List files in s3 zipped object, without downloading it. Returns the number of files inside the zip file.
    See : https://stackoverflow.com/questions/41789176/how-to-count-files-inside-zip-in-aws-s3-without-downloading-it
    Based on : https://stackoverflow.com/questions/51351000/read-zip-files-from-s3-without-downloading-the-entire-file


    bucket_name: name of the bucket
    key_name:  path to zipfile inside bucket
    client_s3: an object created using boto3.client("s3")
    """

    bucket = bucket_name
    key = key_name

    response = client_s3.head_object(Bucket=bucket_name, Key=key_name)
    size = response['ContentLength']

    eocd = fetch(key_name, size - 22, 22, client_s3)

    # start offset and size of the central directory
    cd_start = parse_int(eocd[16:20])
    cd_size = parse_int(eocd[12:16])

    # fetch central directory, append EOCD, and open as zipfile!
    cd = fetch(key_name, cd_start, cd_size, client_s3)
    zip = zipfile.ZipFile(io.BytesIO(cd + eocd))

    print("there are %s files in the zipfile" % len(zip.filelist))

    for entry in zip.filelist:
        print("filename: %s (%s bytes uncompressed)" % (entry.filename, entry.file_size))
    return len(zip.filelist)

if __name__ == "__main__":
    import boto3
    import sys

    client_s3 = boto3.client("s3")
    bucket_name = sys.argv[1]
    key_name = sys.argv[2]
    list_files_in_s3_zipped_object(bucket_name, key_name, client_s3)

The idea will still work, but the code would need some heavy refactoring to include that case. — Daniel777, Nov 11 '20 at 16:33

score 3 · Answer 2 · answered May 26 '21 at 10:30

I improved the already given solution - now it handles also files which are larger than 4GiB:

import boto3
import io
import struct
import zipfile

s3 = boto3.client('s3')

EOCD_RECORD_SIZE = 22
ZIP64_EOCD_RECORD_SIZE = 56
ZIP64_EOCD_LOCATOR_SIZE = 20

MAX_STANDARD_ZIP_SIZE = 4_294_967_295

def lambda_handler(event):
    bucket = event['bucket']
    key = event['key']
    zip_file = get_zip_file(bucket, key)
    print_zip_content(zip_file)

def get_zip_file(bucket, key):
    file_size = get_file_size(bucket, key)
    eocd_record = fetch(bucket, key, file_size - EOCD_RECORD_SIZE, EOCD_RECORD_SIZE)
    if file_size <= MAX_STANDARD_ZIP_SIZE:
        cd_start, cd_size = get_central_directory_metadata_from_eocd(eocd_record)
        central_directory = fetch(bucket, key, cd_start, cd_size)
        return zipfile.ZipFile(io.BytesIO(central_directory + eocd_record))
    else:
        zip64_eocd_record = fetch(bucket, key,
                                  file_size - (EOCD_RECORD_SIZE + ZIP64_EOCD_LOCATOR_SIZE + ZIP64_EOCD_RECORD_SIZE),
                                  ZIP64_EOCD_RECORD_SIZE)
        zip64_eocd_locator = fetch(bucket, key,
                                   file_size - (EOCD_RECORD_SIZE + ZIP64_EOCD_LOCATOR_SIZE),
                                   ZIP64_EOCD_LOCATOR_SIZE)
        cd_start, cd_size = get_central_directory_metadata_from_eocd64(zip64_eocd_record)
        central_directory = fetch(bucket, key, cd_start, cd_size)
        return zipfile.ZipFile(io.BytesIO(central_directory + zip64_eocd_record + zip64_eocd_locator + eocd_record))


def get_file_size(bucket, key):
    head_response = s3.head_object(Bucket=bucket, Key=key)
    return head_response['ContentLength']

def fetch(bucket, key, start, length):
    end = start + length - 1
    response = s3.get_object(Bucket=bucket, Key=key, Range="bytes=%d-%d" % (start, end))
    return response['Body'].read()

def get_central_directory_metadata_from_eocd(eocd):
    cd_size = parse_little_endian_to_int(eocd[12:16])
    cd_start = parse_little_endian_to_int(eocd[16:20])
    return cd_start, cd_size

def get_central_directory_metadata_from_eocd64(eocd64):
    cd_size = parse_little_endian_to_int(eocd64[40:48])
    cd_start = parse_little_endian_to_int(eocd64[48:56])
    return cd_start, cd_size

def parse_little_endian_to_int(little_endian_bytes):
    format_character = "i" if len(little_endian_bytes) == 4 else "q"
    return struct.unpack("<" + format_character, little_endian_bytes)[0]

def print_zip_content(zip_file):
    files = [zi.filename for zi in zip_file.filelist]
    print(f"{len(files)} files: {files}")

Can we also retrieve one file from large ZIP file without downloading? I'm looking for an answer to this question https://stackoverflow.com/questions/68377520/stream-huge-zip-files-on-s3-using-lambda-and-boto3 — N Raghu, Jul 14 '21 at 13:05
It should be possible. I didn't need to implement that, but according to the documentation it can be done. Basically you need EOCD and CD, and then you can find out where local headers are. In local headers, there are information about the corresponding files' sizes. When you have offset and size, you can download just a single file by sending GET with Range header. — kwiecien, Jul 15 '21 at 21:28
I think `parse_little_endian_to_int` should be parsed to `unsigned`, otherwise we can get negative values for cd_start... — Jan Rüegg, Aug 24 '21 at 12:46
can you please explain why we are adding `eocd_record` for zip64 format... as in `return zipfile.ZipFile(io.BytesIO(central_directory + zip64_eocd_record + zip64_eocd_locator + eocd_record))`.. Since we already have `zip64_eocd_record`, then why do we need `eocd_record` at the end of the __zip64__ code block `return` statement — Vivek Puurkayastha, Mar 01 '22 at 19:08
@VivekPuurkayastha take a look at ZIP specification https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT "4.3.6 Overall .ZIP file format" ``` [central directory header n] [zip64 end of central directory record] [zip64 end of central directory locator] [end of central directory record] ``` There are the "standard" EOCD as well as the zip64 EOCD. — kwiecien, Mar 02 '22 at 20:21
@kwiecien thanks for pointing out the structure ... I was referring to the Wikipedia page on ZIP format and their description of ZIP 64 format could probably have been a bit more categorical... — Vivek Puurkayastha, Mar 03 '22 at 10:24

score 0 · Answer 3 · answered Jan 22 '17 at 13:26

0

You can try to download a part of archive (first 1Mb at example) and use jar tool to see filelist and attributes:

jar vt < first-part-of-archive.zip

And you can use subprocess module to obtain this data in python.

answered Jan 22 '17 at 13:26

Stanislav Ivanov

1,854
1
16
22

I am not familiar with Java, and we have no pieces written in Java for this project. How exactly would I use the subprocess module in python to obtain the data? I clicked on the link but got a 404 error. – alfredox Jan 22 '17 at 19:26
To get a part of zip archive, if you have URL you can use methods described in [this question](http://stackoverflow.com/questions/23602412/only-download-a-part-of-the-document-using-python-requests). `jar` tool allow to read content of incomplete zip file (python module or `unzip` tool will not work). – Stanislav Ivanov Jan 23 '17 at 10:48
1

This won't work because the central directory is stored at the rear of the file. – vy32 Jul 15 '18 at 18:22

Sasikumar Murugesan · Answer 4 · 2022-06-30T03:50:06.197

-1

Try below s3 command to get count files in gz format

aws s3 cp <s3 file uri> - | gunzip -c | grep -i '<Search String>' | wc -l

example

aws s3 cp s3://test-bucket/test/test.gz - | gunzip -c | grep -i 'test' | wc -l

edited Jun 30 '22 at 03:50

answered Mar 16 '21 at 12:40

Sasikumar Murugesan

4,412
10
51
74

score -2 · Answer 5 · answered Jan 22 '17 at 11:32

As of now, you cannot get such information without downloading the zip file. You can store the required information as the metadata for a zip file when uploading to s3.

As you have mentioned in your question, using the python functions we are able to get the file list without extracting. You can use the same approach to get the file counts and add as metadata to a particular file and then upload it to S3.

Hope this helps, Thanks

How to count files inside zip in AWS S3 without downloading it?

5 Answers5

Linked