44

I want to access the first 100 rows of a spark data frame and write the result back to a CSV file.

Why is take(100) basically instant, whereas

df.limit(100)
      .repartition(1)
      .write
      .mode(SaveMode.Overwrite)
      .option("header", true)
      .option("delimiter", ";")
      .csv("myPath")

takes forever. I do not want to obtain the first 100 records per partition but just any 100 records.

Why is take() so much faster than limit()?

Marioanzas
  • 1,663
  • 2
  • 10
  • 33
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

5 Answers5

57

Although it still is answered, I want to share what I learned.

myDataFrame.take(10)

-> results in an Array of Rows. This is an action and performs collecting the data (like collect does).

myDataFrame.limit(10)

-> results in a new Dataframe. This is a transformation and does not perform collecting the data.

I do not have an explanation why then limit takes longer, but this may have been answered above. This is just a basic answer to what the difference is between take and limit.

pfnuesel
  • 14,093
  • 14
  • 58
  • 71
Kaspatoo
  • 1,223
  • 2
  • 11
  • 28
  • The difference between action and transformation is correct, but that does not explain why limit should take longer than take (once the plan executes). – Arjen P. De Vries Nov 11 '20 at 08:16
22

This is because predicate pushdown is currently not supported in Spark, see this very good answer.

Actually, take(n) should take a really long time as well. I just tested it, however, and get the same results as you do - take is almost instantaneous irregardless of database size, while limit takes a lot of time.

Thomas
  • 4,696
  • 5
  • 36
  • 71
  • Collect only works in spark dataframes. When I collect first 100 rows it is instant and data resides in memory as a regular list. Collect in sparks sense is then no longer possible. – Georg Heiler Mar 16 '18 at 09:35
  • You are right of course, I forgot take returns a list. I just tested it, and get the same results - I expected both take and limit to be slow. – Thomas Mar 16 '18 at 09:47
  • https://stackoverflow.com/questions/35869884/more-than-one-hour-to-execute-pyspark-sql-dataframe-take4?noredirect=1&lq=1 <- This question however explicitely states that others have problems with `take()` as well - which version of pyspark are you using? – Thomas Mar 16 '18 at 09:48
  • 1
    Spark scala 2.2 – Georg Heiler Mar 16 '18 at 11:32
0

You can use take(n) to limit the data. Adding the complete code with output in the screenshot.enter image description here

Shyam Gupta
  • 489
  • 4
  • 8
0

Limit() will not work in partition, so it will take more time to execute

Ratheesh
  • 21
  • 5
-8

.take() could be the answer, but I used a simple head command like below

df.head(3)

.take() did not work for me.

10 Rep
  • 2,217
  • 7
  • 19
  • 33