5

I have a s3 bucket in which i store datafiles that are to be processed by my pyspark code. the folder i want to access is:

s3a://bucket_name/data/

this folder contains folder. my aim is to access the content of last added folder in this directory. I didnot want to use boto for some reasons. is there any way to access the folder list so i can pick the folder that i suppose to access. I can access files if i specify the folder but i want to make it dynamic.

Jugraj Singh
  • 529
  • 1
  • 6
  • 22

1 Answers1

5

I recommend using s3fs, which is a filesystem-style wrapper on boto3. The docs are here: http://s3fs.readthedocs.io/en/latest/

Here's the part you care about (you may have to pass in or otherwise configure your AWS credentials):

import s3fs
fs = s3fs.S3FileSystem(anon=True)
fs.ls('my-bucket')
szeitlin
  • 3,197
  • 2
  • 23
  • 19
  • 3
    Thanks, this worked great. If you're running this on an EMR cluster, add `pip install s3fs` to your bash bootstrapping script, and you'll probably use `s3fs.S3FileSystem(anon=False)` in your code or EMR notebook. – PHY6 Jun 19 '19 at 16:47