I understand that take(n) will return n elements of an RDD, but how Spark decides from which partition to call those elements from and which elements should be chosen? Does it maintain indexes internally on Driver?
Asked
Active
Viewed 462 times
0
-
3> _"Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit."_ -- [scaladoc](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD). - Basically, every partition has an index, so it knows from where to start. – Luis Miguel Mejía Suárez Mar 21 '19 at 21:03
-
[Spark count vs take and length](https://stackoverflow.com/q/54744663/10938362) – user10938362 Mar 21 '19 at 22:18
1 Answers
0
In the take(n) method of RDD, Spark starts scanning for elements from the first partition. If there are not enough elements in that, Spark increases the number of partitions to scan from. And as for what elements are taken that is determined by the following line
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)
The take(n) method of the Iterator in scala says "Selects first ''n'' values of this iterator."-scaladoc. So as for what elements will be selected, we see elements are selected from the front of the iterator.

mkhan
- 621
- 4
- 10