3

I have two random access iterables: one containing the values and the other containing a permutation index. I want to sort the values and track the changes in the permutation index.

Sample input:

std::array values{7, 45, 18, 33, 77, 96, 83, 80, 4, 51};
std::array<int, values.size()> index;

Output:

values == {4, 7, 18, 33, 45, 51, 77, 80, 83, 96};
index == {8, 0, 2, 3, 1, 9, 4, 7, 6, 5};

The simple solution

One solution is to construct a vector of pairs containing the value and its index, sort this vector by the values and then extract the solution from this. But the original vector may be large so I would like to avoid converting it twice. I will nickname this solution vector pair sort.

Another solution

I can just std::sort the permutation index (which would contain 0, 1, 2... in the beginning) using values in the values iterable. This is described at How to obtain the index permutation after the sorting This would give me the permutation index but it won't sort the value iterable. I could do two things from here:

  1. std::sort() the values.

    I would have to sort the data twice. I would like to avoid that. The comparison operator might not be cheap. This solution will be double sort.

  2. Sort the values using the permutation index.

    I am not exactly sure how to do that. One solution is described here. This solution is boost index apply sort (because it is implemented by boost::algorithm::apply_permutation()). This modifies the permutation index so I have to make a copy before passing it to boost::algorithm::apply_permutation().

    A simple solution that can do it in place is described here. This is permutate in place sort.

    Another solution would be to convert the index into cycles and then apply it. I haven't implemented this.

The "best" solution

I think that the most logical approach is to sort the values normally, but when the sorting algorithm swaps two values, it should also swap corresponding indexed in the permutation index.

This would sort both iterables in place at once without any added cost.

But I have no idea how to implement this. If I would be implementing the sorting algorithm, I could just add a extra swap() and the issue would be solved. But I'd like to reuse std::sort(). It gives me a reasonably fast sorting algorithm without the need of writing it. I could also replace it with a different algorithm in the future by just replacing std::sort() with custom_sort() (assuming that custom_sort() has same signature which is a reasonable assumption).

I am interested in swap(). std::sort() is using this function to swap elements. My goal is to override this function and to make it swap the actual values and the indexes.

One way of doing that is creating a custom iterator that would combine iterators of values and indexes. This iterator would return a reference to a wrapper value upon calling operator* or operator[]. This wrapper value would have a custom swap().

Benchmarks

By the way I have done some benchmarks. You can find them and the implementations at https://github.com/meator/indexsort_impl I tried to sort a vector containing 1 000 000 random ints and doubles. The ints are ranging from INT_MIN to INT_MAX and the doubles are ranging from 0 to 1. I ran 200 samples. Here are the results:

function int benchmark double benchmark
value sort* 114.153 ms 127.824 ms
index sort using values* 177.985 ms 206.153 ms
vector pair sort 122.795 ms 139.467 ms
double sort 291.954 ms 339.121 ms
boost index apply sort 263.827 ms 298.844 ms
permutate in place sort 679.917 ms 708.676 ms

The "simple" vector pair sort is surprisingly good. Maybe the compiler is smart.

Question

Is my "best" solution with iterators viable? How could I implement that? Are there some more solutions that I have dismissed? Which is the best?


EDIT: My benchmarks contained a massive flaw which invalidated all results. I have fixed that. I also added raw sort benchmarks as per Paul Sanders'es suggestion. These of course aren't indexsort.

* As mentioned above, these aren't actual indexsort.

user13840624
  • 133
  • 8
  • 2
    The simple solution might be adapted: sort a [zip_view](https://en.cppreference.com/w/cpp/ranges/zip_view) (std C++23, use [ranges-v3](https://github.com/ericniebler/range-v3) for previous standard). – Jarod42 Jun 19 '23 at 20:56
  • How does `vector pair sort` compare (in terms of execution time) with just sorting the vector of values? – Paul Sanders Jun 19 '23 at 21:31
  • The pair solution is probably fast because it has good locality of reference; the index and value are right next to each other in memory, probably in the same cache line. – Mark Ransom Jun 20 '23 at 03:32
  • I have added simple sorting to the benchmarks. – user13840624 Jun 24 '23 at 18:47
  • @user13840624 most of the time cost is in the sorting. So whatever method that allows fast sorting will be the best. It's very likely `vector pair sort` is optimal. You can replace `std::sort` with `highway:: SIMD sort` or `pdqsort` on Github – Huy Le Jun 29 '23 at 03:26

2 Answers2

1

Your best solution should be doable. According to std::sort one signature until C++20 is

template< class RandomIt >
void sort( RandomIt first, RandomIt last );

and after it changes to

template< class RandomIt >
constexpr void sort( RandomIt first, RandomIt last );

where class RandomIt meets the requirement of both LegacyRandomAccessIterator and ValueSwappable. Which looks like the iterator can contain an index and pointers to both arrays, implement the right methods, and you have a swap function that takes 2 iterators, and swaps the contents of both functions.

And then you can use std::sort with no new large data structures.

btilly
  • 43,296
  • 3
  • 59
  • 88
  • I have tried to do that. But the iterator must be [_ValueSwappable_](https://en.cppreference.com/w/cpp/named_req/ValueSwappable), not [_Swappable_](https://en.cppreference.com/w/cpp/named_req/Swappable) (well it actually has to be _Swappable_ too, but that doesn't help me with indexsort). This means that the value has to be swappable. I not only have to make a custom wrapper iterator, but I also have to make a custom wrapper value. And I have to store this value somewhere. I can't just return it to the caller of `operator*` and `operator[]` because I must return a reference. – user13840624 Jun 24 '23 at 18:44
0

Your "best" solution is not going to be best, because it does O(N log N) value swaps when only O(N) are required.

I think the actually best option is to sort the indexes using the values, and then reorder the values in O(N) time using the permutation index.

Here's a simple implementation that temporarily inverts entries in the permutation index to mark which value positions we've sorted, and then fixes them all. When it's finished, both data and index are correct.

I stole the sort part from @SergeyKalinichenko's answer that you linked:

vector<int> index(data.size(), 0);
for (int i = 0 ; i != index.size() ; i++) {
    index[i] = i;
}
sort(index.begin(), index.end(),
    [&](const int& a, const int& b) {
        return (data[a] < data[b]);
    }
);
for (int target = 0; target < index.size(); ++target) {
    int src = index[target];
    if (src < 0) {
        // did this one already
        index[target] = ~src;
        continue;
    }
    if (src == target) {
        //already in the right place
        continue;
    }
    auto save = data[target];
    int t2 = target;
    do {
        data[t2] = data[src];
        t2 = src;
        src = index[t2];
        index[t2] = ~src; // mark t2 position as done. invar: t2 > target
    } while(src != target);
    data[t2] = save;
}
Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87