5

I have a dataframe with billion records and I wanted to take 10 records out of it.

Which is the better and faster approach?

df.take(10) or df.limit(10).collect()?

Shaido
  • 27,497
  • 23
  • 70
  • 73
Learnis
  • 526
  • 5
  • 25

3 Answers3

7

Both the methods will result in the same performance, simply due to the fact that their implementation is the same.

From Spark implementation on github

def take(n: Int): Array[T] = head(n)

While the implementation of head is:

def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)

As you can see, head is implemented exactly by using limit+collect.

Thus they result in the same performance, the difference you measured must be random variation, try to run the experiment many times to overcome it.

antonpuz
  • 3,256
  • 4
  • 25
  • 48
  • Their implementation are not the same, one "by triggering query execution" –  Oct 07 '19 at 11:48
  • @lssilva You should compare take(i.e. head) to limit + collect, limit does return a new dataset as it is not an action – antonpuz Oct 07 '19 at 12:16
  • @lssilva a) this issue regards Python, not Scala. b) This is exactly the fix where take(n) was replaced by limit+collect, examine the pull request – antonpuz Oct 07 '19 at 20:03
  • You are right, there seems that in the past they had different implementation but this change synchronized it: https://github.com/apache/spark/commit/91f4b6f2db12650dfc33a576803ba8aeccf935dd#diff-7a46f10c3cedbf013cf255564d9483cd –  Oct 08 '19 at 06:55
0

Spark does lazy evolution. so it doesn't matter which API do you use both will give you the same result with same performance.

Gaurang Shah
  • 11,764
  • 9
  • 74
  • 137
-1

Use take(10), it should be instantaneous.

myDataFrame.take(10) //Action
df.limit(10) //Transformation

Reference: spark access first n rows - take vs limit

hagarwal
  • 1,153
  • 11
  • 27
  • But in my case I can see df.limit(10).collect() is a bit faster. My assumption is take(10) from the billion record dataframe is a challenging thing.But cutting of dataframe to 10 records and collecting is better. – Learnis Oct 07 '19 at 05:33
  • And also adding one more point..Even though if we give take(10) again it is internally using Limit(10) with some other function – Learnis Oct 07 '19 at 05:34