How RDD take() method works internally?

Question

I understand that take(n) will return n elements of an RDD, but how Spark decides from which partition to call those elements from and which elements should be chosen? Does it maintain indexes internally on Driver?

> _"Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit."_ -- [scaladoc](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD). - Basically, every partition has an index, so it knows from where to start. — Luis Miguel Mejía Suárez, Mar 21 '19 at 21:03
[Spark count vs take and length](https://stackoverflow.com/q/54744663/10938362) — user10938362, Mar 21 '19 at 22:18

score 0 · Answer 1 · answered Mar 25 '19 at 05:33

In the take(n) method of RDD, Spark starts scanning for elements from the first partition. If there are not enough elements in that, Spark increases the number of partitions to scan from. And as for what elements are taken that is determined by the following line

val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)

The take(n) method of the Iterator in scala says "Selects first ''n'' values of this iterator."-scaladoc. So as for what elements will be selected, we see elements are selected from the front of the iterator.

How RDD take() method works internally?

1 Answers1