Performance comparison with take(10) vs limit(10).collect()

Question

I have a dataframe with billion records and I wanted to take 10 records out of it.

Which is the better and faster approach?

df.take(10) or df.limit(10).collect()?

score 7 · Answer 1 · answered Oct 07 '19 at 07:47

7

Both the methods will result in the same performance, simply due to the fact that their implementation is the same.

From Spark implementation on github

def take(n: Int): Array[T] = head(n)

While the implementation of head is:

def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)

As you can see, head is implemented exactly by using limit+collect.

Thus they result in the same performance, the difference you measured must be random variation, try to run the experiment many times to overcome it.

answered Oct 07 '19 at 07:47

antonpuz

Their implementation are not the same, one "by triggering query execution" – Oct 07 '19 at 11:48
@lssilva You should compare take(i.e. head) to limit + collect, limit does return a new dataset as it is not an action – antonpuz Oct 07 '19 at 12:16
@lssilva a) this issue regards Python, not Scala. b) This is exactly the fix where take(n) was replaced by limit+collect, examine the pull request – antonpuz Oct 07 '19 at 20:03
You are right, there seems that in the past they had different implementation but this change synchronized it: https://github.com/apache/spark/commit/91f4b6f2db12650dfc33a576803ba8aeccf935dd#diff-7a46f10c3cedbf013cf255564d9483cd – Oct 08 '19 at 06:55

score 0 · Answer 2 · answered Oct 07 '19 at 04:05

0

Spark does lazy evolution. so it doesn't matter which API do you use both will give you the same result with same performance.

answered Oct 07 '19 at 04:05

Gaurang Shah

The physical query plan chosen has nothing to do with lazy evaluation or not. – Arjen P. De Vries Nov 11 '20 at 12:54

score -1 · Answer 3 · answered Oct 07 '19 at 04:10

-1

Use take(10), it should be instantaneous.

myDataFrame.take(10) //Action
df.limit(10) //Transformation

answered Oct 07 '19 at 04:10

hagarwal

But in my case I can see df.limit(10).collect() is a bit faster. My assumption is take(10) from the billion record dataframe is a challenging thing.But cutting of dataframe to 10 records and collecting is better. – Learnis Oct 07 '19 at 05:33
And also adding one more point..Even though if we give take(10) again it is internally using Limit(10) with some other function – Learnis Oct 07 '19 at 05:34

3 Answers3