How to sort the content of a partition in Spark?

Question

The repartitionAndSortWithinPartitions method works great.

But I don't really want to re-partition. I am happy with the way data is partitioned naturally.

I do want to sort the content of each partition.

I am not interested in a total sort.

Essentially, I want to avoid the reshuffling of data. I just need to get each partition content sorted.

Possible duplicate of [How to sort within partitions (and avoid sort across the partitions) using RDD API?](https://stackoverflow.com/questions/43339027/how-to-sort-within-partitions-and-avoid-sort-across-the-partitions-using-rdd-a) — ollik1, Jul 25 '19 at 18:17

score 0 · Accepted Answer · answered Jul 25 '19 at 21:34

0

this sorts the data within the partition.

df.sortWithinPartitions('<sort_column>').show()

answered Jul 25 '19 at 21:34

Suresh

1

Thank you unfortunately I was not able to take advantage of this approach. I ended up loading the content of each partition into an ArrayList and sorting the ArrayList. Given my partition size and the container size, I don't think I'll run into an OOM. I will mark this as the correct answer. – hba Jul 26 '19 at 19:52

1 Answers1