0

I have a parquet I read from disk (20,000 partitions) and the display command df.display() returns almost right away, whereas df.limit(1).display() literally takes hours to execute. I don't understand what is going on here. It is also not only the display() command that is slow, but also a join I would actually like to perform. By contrast, df.show(n=1) returns almost instantaneously.

tom10271
  • 4,222
  • 5
  • 33
  • 62

1 Answers1

1

Limit() runs per partition first, then combines the result into a final result. Since there are 20,000 partitions in your data this takes a lot of time to execute.

One solution to still use limit() is to reduce the number of partitions as in this answer with: df.coalesce(1).limit(1).display(). But this is not recommended as all the data will be sent to the driver, and may cause out of memory exception.

viggnah
  • 1,709
  • 1
  • 3
  • 12
  • Any way I could tell pyspark to only consider a single partition? This is really annoying for prototyping the pipeline. – giantsqueed Jul 26 '22 at 16:36
  • @giantsqueed, I am trying to understand here, you already said that `show()` works instantaneously. Any reason you **have** to use `limit()`? If it is to take intermediate data and write for example, can you try `take(1)` or `first()` like in [here](https://stackoverflow.com/questions/46832394/spark-access-first-n-rows-take-vs-limit). Both use `limit()` internally but due to the implementations it could be faster. – viggnah Jul 27 '22 at 05:14