spark access first n rows - take vs limit

Question

I want to access the first 100 rows of a spark data frame and write the result back to a CSV file.

Why is take(100) basically instant, whereas

df.limit(100)
      .repartition(1)
      .write
      .mode(SaveMode.Overwrite)
      .option("header", true)
      .option("delimiter", ";")
      .csv("myPath")

takes forever. I do not want to obtain the first 100 records per partition but just any 100 records.

Why is take() so much faster than limit()?

Indeed I could, but so far have not seen a way to create a df of the local array to use Sparks nice CSV handling capabilities. Limit should just provide this. — Georg Heiler, Oct 20 '17 at 11:13

score 57 · Answer 1 · edited Oct 15 '20 at 09:23

57

Although it still is answered, I want to share what I learned.

myDataFrame.take(10)

-> results in an Array of Rows. This is an action and performs collecting the data (like collect does).

myDataFrame.limit(10)

-> results in a new Dataframe. This is a transformation and does not perform collecting the data.

I do not have an explanation why then limit takes longer, but this may have been answered above. This is just a basic answer to what the difference is between take and limit.

edited Oct 15 '20 at 09:23

pfnuesel

14,093
14
58
71

answered Jun 26 '19 at 12:09

Kaspatoo

1,223
2
11
28

The difference between action and transformation is correct, but that does not explain why limit should take longer than take (once the plan executes). – Arjen P. De Vries Nov 11 '20 at 08:16

Thomas · Accepted Answer · 2018-03-16T09:45:41.847

22

This is because predicate pushdown is currently not supported in Spark, see this very good answer.

Actually, take(n) should take a really long time as well. I just tested it, however, and get the same results as you do - take is almost instantaneous irregardless of database size, while limit takes a lot of time.

edited Mar 16 '18 at 09:45

answered Mar 16 '18 at 09:14

Thomas

4,696
5
36
71

Collect only works in spark dataframes. When I collect first 100 rows it is instant and data resides in memory as a regular list. Collect in sparks sense is then no longer possible. – Georg Heiler Mar 16 '18 at 09:35
You are right of course, I forgot take returns a list. I just tested it, and get the same results - I expected both take and limit to be slow. – Thomas Mar 16 '18 at 09:47
https://stackoverflow.com/questions/35869884/more-than-one-hour-to-execute-pyspark-sql-dataframe-take4?noredirect=1&lq=1 <- This question however explicitely states that others have problems with `take()` as well - which version of pyspark are you using? – Thomas Mar 16 '18 at 09:48
1

Spark scala 2.2 – Georg Heiler Mar 16 '18 at 11:32

score 0 · Answer 3 · answered Mar 15 '21 at 07:08

0

You can use take(n) to limit the data. Adding the complete code with output in the screenshot.

answered Mar 15 '21 at 07:08

Shyam Gupta

489
4
8

score 0 · Answer 4 · answered Feb 18 '22 at 14:12

0

Limit() will not work in partition, so it will take more time to execute

answered Feb 18 '22 at 14:12

Ratheesh

21
5

score -8 · Answer 5 · edited Jan 07 '21 at 17:59

-8

.take() could be the answer, but I used a simple head command like below

df.head(3)

.take() did not work for me.

edited Jan 07 '21 at 17:59

10 Rep

2,217
7
19
33

answered Jan 07 '21 at 12:35

Vignesh M21

1

Unfortunately, this is not an answer, it's more a feeling ... – Gilles Bodart Oct 18 '22 at 12:59

spark access first n rows - take vs limit

5 Answers5

Linked