1

I just realized (after some empirical tests) that applying the limit function on a Dataset produces a new Dataset with only 1 partition. How come ?

I can't find any related questions. And reading the source code on LocalLimit and GlobalLimit didn't provide any insight as I'm not familiar with the internals.

This can be problematic as one might want to use something like limit(1000000) for whatever reason.

Florent Moiny
  • 441
  • 4
  • 7
  • 1
    Just a few related threads - https://stackoverflow.com/q/51465806/10465355, http://apache-spark-developers-list.1001551.n3.nabble.com/Limit-Query-Performance-Suggestion-td20570.html, https://stackoverflow.com/q/45886205/10465355 - TLDR; that is the expected behavior. – 10465355 Nov 21 '19 at 18:54

1 Answers1

-1

Check the value of spark.sparkContext.defaultParallelism. If not explicitly set it uses the number of available cores as the default value for number of partitions. If you're testing on a single-core machine and creating datasets it would be set to 1.

If you're concerned about the number of partitions, you can set spark.default.parallelism to a higher number and/or invoke repartition() on your dataset.

stereosky
  • 124
  • 4
  • The value was not explicitly set but during my tests there were multiple cores on each slave, so it must be something else. – Florent Moiny Nov 21 '19 at 17:22
  • I can't really think of anything else other than if you're using Docker to run your Spark cluster you should check the CPU/resources you've given the Docker engine. To avoid the default partitioning becoming problematic you should make explicit calls to `repartition()`. That's really the solution to solve any parallelism issues. – stereosky Nov 25 '19 at 11:39