I have a dataset in S3 that consists of over 7000 gzipped files that expand to several terrabytes. I am trying to read the data transform it and write it back to S3 using Spark on EMR. The problem I keep running into is that the RDD is too big to fit in memory and as such the process of transforming the RDD slows down to a snails pace as the RDD has to be cached to disk ( it is needed again later to calculate stats ). What I would like to do is to read 100 or a 1000 files process them and then start on the next 1000. If there any way to do this built into the spark framework or do I need to manually list the files and chunk them.
Thanks, Nathan