We have an S3 bucket with large number of files. The list of files is growing everyday. We need a way to get a list of files and generate counts (group by) based on the metadata present in the file name. We don't need the content for this. These files are huge and have binary content so downloading them is not optimal.
We are currently getting a list of file names using S3 Java API, storing them in a list, and processing using Spark. This works for now as the number of files is in the hundreds of thousands but it won't scale to meet our future needs.
Is there a way to do the entire processing using Spark?