I have a dataframe with billion records and I wanted to take 10 records out of it.
Which is the better and faster approach?
df.take(10)
or df.limit(10).collect()
?
I have a dataframe with billion records and I wanted to take 10 records out of it.
Which is the better and faster approach?
df.take(10)
or df.limit(10).collect()
?
Both the methods will result in the same performance, simply due to the fact that their implementation is the same.
From Spark implementation on github
def take(n: Int): Array[T] = head(n)
While the implementation of head is:
def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)
As you can see, head is implemented exactly by using limit
+collect
.
Thus they result in the same performance, the difference you measured must be random variation, try to run the experiment many times to overcome it.
Spark does lazy evolution. so it doesn't matter which API do you use both will give you the same result with same performance.
Use take(10), it should be instantaneous.
myDataFrame.take(10) //Action
df.limit(10) //Transformation
Reference: spark access first n rows - take vs limit