I'm trying to generate a list of all S3 files in a bucket/folder. There are usually in the magnitude of millions of files in the folder. I use boto right now and it's able to retrieve around 33k files per minute, which for even a million files, takes half an hour. I also load these files into a dataframe, but generate and use this list as a way to track which files are being processed.
What I've noticed is that when I ask Spark to read all files in the folder, it does a listing of its own and is able to list them out much faster than the boto call can, and then process those files. I looked up a way to do this in PySpark, but found no good examples. The closest I got was some Java and Scala code to list out the files using the HDFS library.
Is there a way we can do this in Python and Spark? For reference, I'm trying to replicate the following code snippet:
def get_s3_files(source_directory, file_type="json"):
s3_resource = boto3.resource("s3")
file_prepend_path = f"/{'/'.join(source_directory.parts[1:4])}"
bucket_name = str(source_directory.parts[3])
prefix = "/".join(source_directory.parts[4:])
bucket = s3_resource.Bucket(bucket_name)
s3_source_files = []
for object in bucket.objects.filter(Prefix=prefix):
if object.key.endswith(f".{file_type}"):
s3_source_files.append(
(
f"{file_prepend_path}/{object.key}",
object.size,
str(source_directory),
str(datetime.now()),
)
)
return s3_source_files