6

I've got a list of keys that I'm retrieving from a cache, and I want to download the associated objects (files) from S3 without having to make a request per key.

Assuming I have the following array of keys:

key_array = [
    '20160901_0750_7c05da39_INCIDENT_MANIFEST.json',
    '20161207_230312_ZX1G222ZS3_INCIDENT_MANIFEST.json',
    '20161211_131407_ZX1G222ZS3_INCIDENT_MANIFEST.json',
    '20161211_145342_ZX1G222ZS3_INCIDENT_MANIFEST.json',
    '20161211_170600_FA68T0303607_INCIDENT_MANIFEST.json'
]

I'm trying to do something similar to this answer on another SO question, but modified like so:

import boto3

s3 = boto3.resource('s3')

incidents = s3.Bucket(my_incident_bucket).objects(key_array)

for incident in incidents:
    # Do fun stuff with the incident body
    incident_body = incident['Body'].read().decode('utf-8')

My ultimate goal being that I'd like to avoid hitting the AWS API separately for every key in the list. I'd also like to avoid having to pull the whole bucket down and filtering/iterating the full results.

Community
  • 1
  • 1
afilbert
  • 1,430
  • 2
  • 24
  • 25
  • `avoid hitting the AWS API separately for every key` and `avoid having to pull the whole bucket down and filtering/iterating the full results`. How else can you do it? Do your keys follow a pattern? – helloV Jan 06 '17 at 22:00
  • @helloV I was hoping that S3 could accept an array (or otherwise delimited list) of keys that I send in a request, and return matching objects. I've been pouring over the documentation for both boto3 and AWS, and haven't found anything, so thought I'd ask here. The keys have a common prefix, but the response from my cache may vary depending on search parameters. – afilbert Jan 06 '17 at 22:03
  • There is no such feature available unless all your keys have same prefix. – helloV Jan 06 '17 at 22:08
  • So if I have 10K files and they all have a common prefix, I can grab all matching that common prefix. But what if I wanted just 10 files out of that 10K? If I already know the key names, it seems expensive to have to grab all 10K files just to use 10. – afilbert Jan 06 '17 at 22:11
  • No. You get the key names not the body. Loop through the list of key names and fetch only the one you are interested. AWS returns 1000 key names and you have to make another call to get the next batch. – helloV Jan 06 '17 at 22:15
  • I've got a collection that's auto-paging through boto3, so that accounts for the 1000 key name limit. How do you recommend that I fetch the object if it matches any of the keys that I'm after? That's the crux of my question, given that I already have a list of key names that I know I want. – afilbert Jan 06 '17 at 22:20
  • 2
    You cannot get the contents of multiple objects in a single request. – Jordon Phillips Jan 06 '17 at 23:08
  • If day-old query works for you, you may consider S3 storage inventory. http://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html – mootmoot Jan 09 '17 at 09:55
  • unfortunately, It is not possible to do a bulk upload. You have to do n calls for the n jsons. This increasing the final bill, but this is the only solution. – MouIdri Apr 24 '17 at 14:05

1 Answers1

9

I think the best you are going to get is n API calls where n is the number of keys in your key_array. The amazon API for s3 doesn't offer much in the way of server-side filtering based on keys, other than prefixes. Here is the code to get it in n API calls:

import boto3
s3 = boto3.client('s3')

for key in key_array:
    incident_body = s3.get_object(Bucket="my_incident_bucket", Key=key)['Body']

    # Do fun stuff with the incident body
Kevin Seaman
  • 652
  • 3
  • 9
  • Considering my incident list can be arbitrarily numerous, I've decided to move this bit of work to a background process and cache the actual json results. Yay for PostgreSQL native JSON data types. – afilbert Jan 07 '17 at 00:19
  • 1
    Is there some new APIs to achieve the goal now of getting rid of the multiple API calls? – ddttdd Jul 02 '21 at 19:22