0

I have a dataset in S3 that consists of over 7000 gzipped files that expand to several terrabytes. I am trying to read the data transform it and write it back to S3 using Spark on EMR. The problem I keep running into is that the RDD is too big to fit in memory and as such the process of transforming the RDD slows down to a snails pace as the RDD has to be cached to disk ( it is needed again later to calculate stats ). What I would like to do is to read 100 or a 1000 files process them and then start on the next 1000. If there any way to do this built into the spark framework or do I need to manually list the files and chunk them.

Thanks, Nathan

Nath5
  • 1,665
  • 5
  • 28
  • 46
  • You could try creating DataFrames, and save them as Parquet files. You can use `SaveMode.Append`, and only save the new rows. Then just open the `DataFrame` later when you want to calculate your stats. – David Griffin Apr 04 '16 at 16:46
  • One of the approach would be to use stream the files from your S3 directory using spark streaming and then process them one by one. http://stackoverflow.com/questions/30994401/spark-streaming-on-a-s3-directory One more ref. http://stackoverflow.com/questions/24402737/how-to-read-gz-files-in-spark-using-wholetextfiles – charles gomes Apr 04 '16 at 16:57

0 Answers0