0

The repartitionAndSortWithinPartitions method works great.

But I don't really want to re-partition. I am happy with the way data is partitioned naturally.

I do want to sort the content of each partition.

I am not interested in a total sort.

Essentially, I want to avoid the reshuffling of data. I just need to get each partition content sorted.

hba
  • 7,406
  • 10
  • 63
  • 105
  • Possible duplicate of [How to sort within partitions (and avoid sort across the partitions) using RDD API?](https://stackoverflow.com/questions/43339027/how-to-sort-within-partitions-and-avoid-sort-across-the-partitions-using-rdd-a) – ollik1 Jul 25 '19 at 18:17
  • You can improve the title. – thebluephantom Jul 25 '19 at 18:37
  • @ollik1 - thank you I did go through that question as well. – hba Jul 26 '19 at 19:54

1 Answers1

0

this sorts the data within the partition.

df.sortWithinPartitions('<sort_column>').show()
Suresh
  • 38,717
  • 16
  • 62
  • 66
  • 1
    Thank you unfortunately I was not able to take advantage of this approach. I ended up loading the content of each partition into an ArrayList and sorting the ArrayList. Given my partition size and the container size, I don't think I'll run into an OOM. I will mark this as the correct answer. – hba Jul 26 '19 at 19:52