I'm a developer fairly new to AWS and S3 that has been tasked with iterating over every object in a large, active S3 bucket and performing processing on the individual object summaries, but not the objects themselves.
The S3 bucket continually has new objects being written to it and existing objects read, but no deletions are taking place or existing objects being updated. We can assume that any new objects arriving in S3 have already been processed so we only need to worry about the historical data. Processing already-processed data is not optimal, but acceptable if there is no effective way around it. New objects are arriving at an unknown rate that could be higher or lower than the rate at which a single thread could make a ListObjectsV2Request
, process the object summaries, and retrieve the next page of results.
The number of objects within the S3 bucket is ~100 000 000 (100 million), far too large for a single ListObjectsV2Request
request and would be too large for memory if all object summaries could be retrieved in a single request.
Is there any way to take a "snapshot" of the current objects within the bucket and perform my processing on that? Failing that, does the pagination supported by the AWS v2 SDK operate on a "snapshot" of when the first request was made or will it continually feed in new objects as they are written to the bucket after the first request?
I.e. Is it possible to perform
ListObjectsV2Request request = new ListObjectsV2Request.withBucketName(myBucket).withMaxKeys(MAX_KEYS);
ListObjectsV2Result result;
do {
result = s3Client.listObjectsV2(request);
processObjectSummaries(result.getObjectSummaries());
String continuationToken = result.getNextContinuationToken();
request.setContinuationToken(token);
} while (result.isTruncated());
and process my historical data with minimal re-processing of new data?