I have worked with both a Parallel Quicksort algorithm and a PSRS algorithm that essentially combines quicksort in parallel with merging.
With the Parallel Quicksort algorithm, I have demonstrated near linear speedup with up to 4 cores (dual core with hyper-threading), which is expected given the limitations of the algorithm. A pure Parallel Quicksort relies on a shared stack resource which will result in contention among threads, thus reducing any gain in performance. The advantage of this algorithm is that it sorts 'in-place,' which reduces the amount of memory needed. You may want to consider this when sorting upwards of 100M elements as you stated.
I see you are looking to sort on a system with 8-32 cores. The PSRS algorithm avoids contention at the shared resource, allowing speedup at higher numbers of processes. I have demonstrated the algorithm with up to 4 cores as above, but experimental results of others report near linear speedup with much larger numbers of core, 32 and beyond. The disadvantage of the PSRS algorithm is that it is not in-place and will require considerably more memory.
If you are interested, you may use or peruse my Java code for each of these algorithms. You can find it on github: https://github.com/broadbear/sort. The code is intended as a drop-in replacement of Java Collections.sort(). If you are looking for the ability to perform parallel sorting in a JVM as you state above, the code in my repo may help you out. The API is fully genericized for elements implementing Comparable or implementing your own Comparator.
May I ask what you are looking to sort that many elements for? I'm interested to know of potential applications for my sorting package.