5

We have an S3 bucket with large number of files. The list of files is growing everyday. We need a way to get a list of files and generate counts (group by) based on the metadata present in the file name. We don't need the content for this. These files are huge and have binary content so downloading them is not optimal.

We are currently getting a list of file names using S3 Java API, storing them in a list, and processing using Spark. This works for now as the number of files is in the hundreds of thousands but it won't scale to meet our future needs.

Is there a way to do the entire processing using Spark?

R. Puram
  • 51
  • 1
  • 3
  • 1
    It sounds like you are better off storing + indexing the file names in a database. If you are only after the names, I would also suggest this method: http://stackoverflow.com/questions/3337912/quick-way-to-list-all-files-in-amazon-s3-bucket without using the Java API – GameOfThrows Dec 01 '15 at 15:27

1 Answers1

0

I achieved something similar by modifying FileInputDStream so that rather than loading the contents of the files into the RDD, it simply creates an RDD from the filenames.

This gives a performance boost if you don't actually want to read the data itself into the RDD, or want to pass filenames to an external command as one of your steps.

Simply change filesToRDD(..) so that it makes an RDD of the filenames, rather than loading the data into the RDD.

See: https://github.com/HASTE-project/bin-packing-paper/blob/master/spark/spark-scala-cellprofiler/src/main/scala/FileInputDStream2.scala#L278

Ben
  • 275
  • 3
  • 12