Most efficient way to select the last N elements of a Spark RDD in Java

Question

Which is the most efficient (privileging time over memory) way to select the last N (let's say 10) elements of a JavaRDD in Spark (I'm currently using v1.6)

Last according to what criteria? If you sort data before this is already a inefficient approach. — zero323, Aug 08 '16 at 15:01
There's no sorting. According to [thread](http://stackoverflow.com/questions/29284095/which-operations-preserve-rdd-order), if I read some data into a RDD, the ordered in which these data have been written is preserved (unless I do not do anything that explicitly break it). So, let's assume I read a text file with 10000000 lines into a RDD, and I want to access/select only the last 10. — McKracken, Aug 08 '16 at 15:08
So... Why not process N elements of interest alone in the first place if you care about speed? — zero323, Aug 08 '16 at 15:25
Let's say I have to process the RDD a lot before: I want to find all the lines that contain a specific word, let's say `running`, and then among all of them (let's say they are 10293 in 10000000), I want just the last 10 (this is not my real problem, but structurally is the same. I cannot process only the last N elements, as they might be the last N only after a long processing. I just wanted to make an easy example instead of loading everything with complexity coming from my case). — McKracken, Aug 08 '16 at 15:39

Most efficient way to select the last N elements of a Spark RDD in Java

0 Answers0