1

I'm investigating using Spark for a project with hundreds of GB of data being generated per hour. I'm struggling with getting the first step optimal, though it feels like it should be simple!

Suppose there is a daily process that splits the data into two parts, neither of which are small enough to cache in memory. Is there a way to do this in a single spark job without the source data having to be loaded from HDFS and parsed twice?

Say I want to write all the "Dog" events into a new HDFS location, and all the "Cat" events into another. As soon as I specify the first "write to file" action (for the dogs), Spark will set off loading each file into RAM in turn and filtering out all the dogs. Then it'll have to re-parse every file to find the cats, even though it could have done both at once. It could have loaded each partition of the raw data and written out the cats and dogs for that partition in one go.

If the data would fit in memory I could filter down to both types of events, then cache, then do the two writes. But it doesn't, and I can't see how to make spark do this on a per-partition basis as it goes along. I've studied the API docs thinking I must be missing something...

Any advice would be greatly appreciated, thanks!

MatBailey
  • 41
  • 2
  • Possible duplicate of [How to split a RDD into two or more RDDs?](http://stackoverflow.com/questions/32970709/how-to-split-a-rdd-into-two-or-more-rdds) –  Sep 24 '16 at 21:06

0 Answers0