I have a parquet I read from disk (20,000 partitions) and the display command df.display()
returns almost right away, whereas df.limit(1).display()
literally takes hours to execute. I don't understand what is going on here. It is also not only the display()
command that is slow, but also a join I would actually like to perform. By contrast, df.show(n=1)
returns almost instantaneously.
Asked
Active
Viewed 50 times
0

tom10271
- 4,222
- 5
- 33
- 62

giantsqueed
- 59
- 4
1 Answers
1
Limit()
runs per partition first, then combines the result into a final result. Since there are 20,000 partitions in your data this takes a lot of time to execute.
One solution to still use limit()
is to reduce the number of partitions as in this answer with: df.coalesce(1).limit(1).display()
. But this is not recommended as all the data will be sent to the driver, and may cause out of memory exception.

viggnah
- 1,709
- 1
- 3
- 12
-
Any way I could tell pyspark to only consider a single partition? This is really annoying for prototyping the pipeline. – giantsqueed Jul 26 '22 at 16:36
-
@giantsqueed, I am trying to understand here, you already said that `show()` works instantaneously. Any reason you **have** to use `limit()`? If it is to take intermediate data and write for example, can you try `take(1)` or `first()` like in [here](https://stackoverflow.com/questions/46832394/spark-access-first-n-rows-take-vs-limit). Both use `limit()` internally but due to the implementations it could be faster. – viggnah Jul 27 '22 at 05:14