0

I am using python 2.7.x, and Boto API 2.X to connect to AWS S3 bucket. I have a unique situation where I want to download files from S3 bucket that to from a specific directory/folder say myBucket/foo/. But the catch is I want to leave a latest file behind in S3 folder and not download it. Once, I download these files on my local box, I want to move these files out to a different folder under the same bucket say myBucket/foo/bar/. Has anyone worked on similar situation before?

Here is some explanation:

  1. Move downloaded files from an S3 bucket to a different folder path under the same bucket.

My S3 bucket : event-logs The folder path on S3 bucket from where files will be downloaded:

event-logs/apps/raw/source_data/

The folder path on S3 bucket where the downloaded files will be moved(archive):

event-logs/apps/raw/archive_data/ 

Note: The "event-logs/apps/raw/" path is common above under the same bucket

So if I have 5 files under source_data folder on S3:

s3://event-logs/apps/raw/source_data/data1.gz
event-logs/apps/raw/source_data/data2.gz
event-logs/apps/raw/source_data/data3.gz
event-logs/apps/raw/source_data/data4.gz
event-logs/apps/raw/source_data/data5.gz

I need to download first 4 files (oldest files) to my local machine and leave the latest file I.e. data5.gz behind. After the download is complete move those files from S3 ../source_data folder to ../Archive_data folder under the same S3 bucket and delete from the original source_data folder. Here is my code to list the files from S3, then to download files and then to delete the files.

AwsLogShip = AwsLogShip(aws_access_key, aws_secret_access_key, use_ssl=True)
bucket = AwsLogShip.getFileNamesInBucket(aws_bucket)
def getFileNamesInBucket(self, aws_bucket):
    if not self._bucketExists(aws_bucket):
        self._printBucketNotFoundMessage(aws_bucket)
        return list()
    else:
        bucket = self._aws_connection.get_bucket(aws_bucket)
        return map(lambda aws_file_key: aws_file_key.name, bucket.list("apps/raw/source_data/"))

AwsLogShip.downloadAllFilesFromBucket(aws_bucket, local_download_directory)
def downloadFileFromBucket(self, aws_bucket, filename, local_download_directory):
    if not self._bucketExists(aws_bucket):
        self._printBucketNotFoundMessage(aws_bucket)
    else:
        bucket = self._aws_connection.get_bucket(aws_bucket)
        for s3_file in bucket.list("apps/raw/source_data"):
            if filename == s3_file.name:
                self._downloadFile(s3_file, local_download_directory)
                Break;

AwsLogShip.deleteAllFilesFromBucket(aws_bucket)
def deleteFilesInBucketWith(self, aws_bucket, filename):
    if not self._bucketExists(aws_bucket):
        self._printBucketNotFoundMessage(aws_bucket)
    else:
        bucket = self._aws_connection.get_bucket(aws_bucket)
        for s3_file in filter(lambda fkey: filename(fkey.name), bucket.list("apps/raw/source_data/")):
            self._deleteFile(bucket, s3_file)

What I really want to achieve here is:

  1. Select a list of oldest files to download, which means always leave the latest modified file behind and not perform any action on it (as the idea is that file may not be ready to download or it is still being written).
  2. The same list of files which was downloaded.. needs to be moved into new location under the same bucket and delete those from the original source_data folder.
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Guddi
  • 69
  • 3
  • 15
  • Your requirements are difficult to understand. Could you provide an example (eg show contents before & after)? – John Rotenstein Sep 01 '15 at 10:12
  • The answer to this question might help - http://stackoverflow.com/questions/30161700/how-to-move-files-between-two-amazon-s3-buckets-using-boto – David Fevre Sep 01 '15 at 11:58
  • @John Rotenstein - as per your request i have updated the question with more details. Can I please have your expert opinion on this? – Guddi Sep 02 '15 at 19:32
  • Yes, it's all possible but here's a thought... Why do you wish to skip the "latest" file? If it is still "being written to", it probably won't appear in your object listing. Is it "written" by more than one process? Does it get overwritten? If not, it will probably only appear after it has been fully created. This would greatly simplify your requirements down to about 3 lines: Loop through a list, Download, Move. – John Rotenstein Sep 05 '15 at 03:28
  • Here's another thought... What are you intending to do with the files once they have been downloaded -- are you processing them individually, or just storing a copy? If it's just to keep a copy, then I'd recommend using the [AWS Command-Line Interface (CLI)](http://aws.amazon.com/cli/) `aws s3 sync` command to synchronise a local directory with the bucket. That way, you wouldn't even need to move the files to a different directory since there's no "processing" taking place. Can you provide a bigger explanation of your use-case, eg why you need to move the files this way? – John Rotenstein Sep 05 '15 at 11:55
  • @JohnRotenstein - Thank you for your comments. This is how i achieved this. bucket_list = bucket.list(prefix='Download/test_queue1/', delimiter='/') list1 = sorted(bucket_list, key= lambda item1: item1.last_modified) self.list2 = list1[:-1] for item in self.list2: self._bucketList(bucket, item) – Guddi Sep 10 '15 at 04:59

1 Answers1

0

This is how I solved this problem!

     bucket_list = bucket.list(prefix='Download/test_queue1/', delimiter='/')
     list1 = sorted(bucket_list, key= lambda item1: item1.last_modified)
     self.list2 = list1[:-1]
     for item in self.list2:
         self._bucketList(bucket, item)

    def _bucketList(self,bucket, item):
    print item.name, item.last_modified
Guddi
  • 69
  • 3
  • 15