Which is the most efficient (privileging time over memory) way to select the last N (let's say 10) elements of a JavaRDD in Spark (I'm currently using v1.6)
Asked
Active
Viewed 373 times
3
-
Last according to what criteria? If you sort data before this is already a inefficient approach. – zero323 Aug 08 '16 at 15:01
-
There's no sorting. According to [thread](http://stackoverflow.com/questions/29284095/which-operations-preserve-rdd-order), if I read some data into a RDD, the ordered in which these data have been written is preserved (unless I do not do anything that explicitly break it). So, let's assume I read a text file with 10000000 lines into a RDD, and I want to access/select only the last 10. – McKracken Aug 08 '16 at 15:08
-
So... Why not process N elements of interest alone in the first place if you care about speed? – zero323 Aug 08 '16 at 15:25
-
Let's say I have to process the RDD a lot before: I want to find all the lines that contain a specific word, let's say `running`, and then among all of them (let's say they are 10293 in 10000000), I want just the last 10 (this is not my real problem, but structurally is the same. I cannot process only the last N elements, as they might be the last N only after a long processing. I just wanted to make an easy example instead of loading everything with complexity coming from my case). – McKracken Aug 08 '16 at 15:39