1

I have some folders in S3 bucket in which i have files. Since S3 stores data like a unix system and thus the ordering of folder numbers are 1,10,11,12,2,3 instead of 1,2,3,10,11,12..

I'd like to read folders in sequence 1,2,3,10,11,12.. and then read the files in them..

I have attached a snippet along with a code that i'm trying but it's not working the way i want. As you see the folder name has a number(-0.png-analysis,-1.png-analysis,-10.png-analysis,-11.png-analysis,-2.png-analysis) but the sequencing is incorrect.. Is there a way they can be read in 0,1,2,3,10,11 order?

for i in bucket.objects.all():
    #print(i.key)
    if i.key.endswith('tables.csv'):
        #s = i.key.split('-')[2]
        print(i.key.split('/')[1])
        #print(sorted(s,key = lambda x: x.split('.')))
        #p = i.key.split('-')[2]
        #print(p)

enter image description here

karan
  • 309
  • 2
  • 10
  • One workaround could be to store all the objects in a `dict` with keys as their sequence (1,2,3,11,10...) and then loop over that dict to get each object. – CaffeinatedCod3r Sep 01 '20 at 09:41
  • Thanks for the response, but can you please tell me in terms of code? i'm not able to convert this logic of yours into code..@CaffeinatedCod3r – karan Sep 02 '20 at 11:13

1 Answers1

1

As i said to store all objects using their sequence number as key in a dict and iterating on this dict.

Here's how it would look like


import boto3
import collections

s3 = boto3.client('s3')    
my_dict = {}

for obj in bucket.objects.all():
    if obj.key.endswith('tables.csv'):
        my_dict[int(obj.key.split('/')[1].split('-')[2].split('.')[0])] = obj.key
    
print(my_dict)

od = collections.OrderedDict(sorted(my_dict.items()))

for k,v in od.items():
    csv_obj = s3.get_object(Bucket='bucket', Key=v) 
    print(csv_obj['Body'].read().decode('utf-8'))

NOTE: I assume you don't have any two files which have same sequence as this will only get the latest file with that sequence number and you will not be able to retrieve previous files.

OrderedDict copied from https://stackoverflow.com/a/9001529/9387017

CaffeinatedCod3r
  • 821
  • 7
  • 14
  • So i followed this logic and getting this error: ParamValidationError: Parameter validation failed: Invalid type for parameter Key, value: s3.ObjectSummary(bucket_name='textractpipelinestack-documentsbucket9ec9deb9-1rm7fo8ds7m69', key='testfolder/Star Experience-page-0.png-analysis/4ce4e750-ede4-11ea-b9c2-1ee1073dc3d0/page-1-tables.csv'), type: , valid types: – karan Sep 03 '20 at 13:02
  • Sorry my bad. when assigning to dict it should be `obj.key`. Fixed the code now. Please check it @karan – CaffeinatedCod3r Sep 04 '20 at 04:31
  • Aah thanks for the correction, the code is able to read the files but they are still not sorted in the way.. They are still being read in the order 0,1,10,11... Do you have any idea? – karan Sep 04 '20 at 10:19
  • If you try to print the output of the dict, are they in sorted order (1,2,3...)? – CaffeinatedCod3r Sep 04 '20 at 10:21
  • No, it's like even when i print 'k' or 'v' or 'my_dict' they are in 0,1,10,11,12,2,3 order... :(.. – karan Sep 04 '20 at 10:30
  • @ CaffeinatedCod3r Yes, It Worked, you are great man!! Thanks a million!!! :) – karan Sep 04 '20 at 11:41
  • Glad to help!! Thank you – CaffeinatedCod3r Sep 04 '20 at 13:30