Find the unique sorted values of a vector and also indexes of the sorted vector to regenerate the original vector

Question

Suppose we have a vector like

a = {2,2,2,1,7,7,7,5,5}

I wish to get a sorted vector of unique elements which in this case gives

b = {1,2,5,7}

I also wish to get a vector of indexes of this sorted vector that reconstructs the initial vector:

c = {1,1,1,0,3,3,3,2,2}

Here b[x] for each element x in c returns the original vector a.

{ b[1],b[1],b[1],b[0],b[3],b[3],b[3],b[2],b[2] } = { 2,2,2,1,7,7,7,5,5 }

I wish to write an efficient algorithm in C++ to give me vectors b and c, given a vector a.

I have implemented a brute force solution below:

    std::vector<int> vector_a{ 2,2,2,1,7,7,7,5,5 };

    std::vector<int> vector_b(vector_a); 
    std::sort(vector_b.begin(), vector_b.end());
    std::vector<int>::iterator ip;

    ip = std::unique(vector_b.begin(), vector_b.end());
    vector_b.resize(std::distance(vector_b.begin(), ip));

    std::vector<int> vector_c;

    for (int i = 0; i < vector_a.size(); ++i) {
        for (int j = 0; j < vector_b.size(); ++j) {
            if (vector_a[i] == vector_b[j]) {
                vector_c.push_back(j);
            }
        }
    }

Is there a way I can implement a more efficient solution with complexity lesser than O(m*n)?

Sounds like an interesting project. What is your actual question? — Nathan Pierson, Mar 17 '22 at 07:55
what did you try? You'd be surprised how often searching the words helps to find the right algorithms. It is `std::sort` and `std::unique`. Only to get the indices a bit extra work is needed — 463035818_is_not_an_ai, Mar 17 '22 at 07:59
@463035818_is_not_a_number I have implemented a brute force solution using std::sort and std::unique. I have added that that to the question. — Shakeel Hassan, Mar 17 '22 at 08:33
@NathanPierson I am trying to implement fuzzy image contrast enhancement — Shakeel Hassan, Mar 17 '22 at 08:34
ok now we are just missing a question. Whats wrong with the solution you have already? — 463035818_is_not_an_ai, Mar 17 '22 at 08:36
I feel my solution is inefficient. Since I am working with images with typical resolution of 1024*1024, my vector_a has around 1 million values and vector_b has 256 possible pixel values(0-255). Hence I was wondering if there is a solution better than the bruteforce one I have implemented. — Shakeel Hassan, Mar 17 '22 at 08:49

Serge Ballesta · Answer 1 · 2022-03-17T10:53:51.790

I would use a map of vectors or lists. The index would be the unique value, and the value would be the list of indices in the original vector.

It can be built in one single pass (complexity O(n)) and can in turn build the b and c vectors in one single pass (again complexity O(n))

Possible code:

#include <iostream>
#include <map>
#include <vector>

int main()
{
    std::vector<int> a { 2,2,2,1,7,7,7,5,5 };
    std::map<int, std::vector<int> > m;
    for (int i = 0; i < a.size(); i++) {
        m[a[i]].push_back(i);
    }
    std::vector<int> b(m.size()), c(a.size());
    int i = 0;
    for (auto& it : m) {
        for (int j : it.second) {
            c[j] = i;
        }
        b[i++] = it.first;
    }
    std::cout << "b: " << b[0];
    for (int i = 1; i < b.size(); i++) std::cout << ", " << b[i];
    std::cout << "\nc: " << c[0];
    for (int i = 1; i < c.size(); i++) std::cout << ", " << c[i];
    std::cout << std::endl;
    return 0;
}

It gives as expected:

b: 1, 2, 5, 7
c: 1, 1, 1, 0, 3, 3, 3, 2, 2

After a second thought, the complexity should be slightly more than O(n) because inserting or searching in a std::map is not constant but is O(log(n)). But is should still be at most O(n*log(n))...

Thank you for your clear explanation and awesome solution. I have implemented it and It resulted in a big speedup for my project. — Shakeel Hassan, Mar 17 '22 at 11:52
As indicated by OP in a comment, the number of values if less than `M = 256`, independent of `n`. Insertion in the map should be `O(log M) = O(1)`. So your method is indeed O(N). — Damien, Mar 17 '22 at 14:22

Artyer · Answer 2 · 2022-03-17T09:30:40.720

What you want is what numpy.unique(a, return_inverse=True) returns in Python. I've found similar questions here: C++ equivalent of numpy.unique on std::vector with return_index and return_inverse, Find the indices and inverse mapping of a unique vector.

Here's a strategy on how to accomplish this:

Generate [0, a.size()) in a vector, i
Sort i based on the value in vector a at the given index
Find the unique values by iterating over i and finding the indices whose item in a are different from the index before (including the first). This is similar to the std::unique algorithm.
Find the "inverse indices" by iterating over i and assigning each index of the inverses by a counter that increments when there is a new unique value. This incrementing counter corresponds exactly to indices in the unique vector.

(The last two steps can be done together).

Here is an example:

template<typename T>
struct unique_indices {
    std::vector<T> uniques;
    std::vector<std::size_t> inverse_indices;
};

template<typename T, typename Cmp = std::less<T>>
unique_indices<T> sort_unique(const std::vector<T>& v, Cmp&& cmp = {}) {
    unique_indices<T> result;

    // generate indices
    std::vector<std::size_t> sorted_indices(v.size());
    std::iota(sorted_indices.begin(), sorted_indices.end(), std::size_t{0});

    // sort based on indices
    // (Could want std::stable_sort here if you care about which exact
    //  item is copied into `unique`, but doesn't matter for `int` with `std::less`)
    std::sort(sorted_indices.begin(), sorted_indices.end(), [&](std::size_t i, std::size_t j) {
        return cmp(v[i], v[j]);
    });

    // get unique values + set inverse indices
    // (similar to std::unique algorithm)
    auto first = sorted_indices.begin();
    auto last = sorted_indices.end();
    if (first == last) return result;  // v was empty

    result.inverse_indices.resize(v.size());

    std::size_t unique_index = 0;
    result.uniques.push_back(v[*first]);
    result.inverse_indices[*first] = unique_index;

    while (++first != last) {
        if (cmp(result.uniques.back(), v[*first])) {
            // new value
            ++unique_index;
            result.uniques.push_back(v[*first]);
        }
        result.inverse_indices[*first] = unique_index;
    }

    return result;
}

std::vector<int> vector_a = {2,2,2,1,7,7,7,5,5};
auto [vector_b, vector_c] = sort_unique(vector_a);
// vector_b == {1,2,5,7}
// vector_c == {1,1,1,0,3,3,3,2,2}

O(N lg N) runtime and O(N) extra space needed.
Your original solution was O(N lg N + N M) time also with O(N) extra space.

(Where N = a.size() and M is the number of unique values in a)

Thank you for your solution. I tried it and it worked. However I found it a bit difficult to understand as my main language is python and I am porting code from python to c++. To understand my code better in the future, I have gone ahead with the answer provided by @Serge Ballesta. Appreciate your help. — Shakeel Hassan, Mar 17 '22 at 11:55

Damien · Answer 3 · 2022-03-17T13:41:31.593

You indicated in a comment that in your use case, all values are in between [0, 255].

This allows to implement a variant of counting sort. Three steps:

Mark which values are used (array values): values[a[i] = 1.
looping over array values, add relevant ones in the vector unique, and at the same time associate each value to the corresponding index of the vector
Looping over the input array a[], directly find the corresponding index values[a[i]], and add this index in the vector indexes.

Total complexity: O(n)

#include <iostream>
#include <vector>
#include <array>
#include <utility>

#define NMAX 256

void print (const std::vector<int> &v, const std::string& mes = "") {
        std::cout << mes;
        for (int i: v) {
                std::cout << i << " ";
        }
        std::cout << "\n";
}

std::pair<std::vector<int>, std::vector<int>> sort_unique (const std::vector<int> &a) {
    int n = a.size();
    std::vector<int> unique, indexes (n);
    std::array<int, NMAX> values;
    values.fill (0);
    unique.reserve (NMAX);
    for (auto& x: a) {
        values[x] = 1;
    }
    int cpt = 0;
    for (int i = 0; i < NMAX; ++i) {
        if (values[i] == 0) continue;
        values[i] = cpt++;
        unique.push_back(i);
    }
    for (int i = 0; i < n; ++i) {
        indexes[i] = values[a[i]];
    }
    return {unique, indexes};
}

int main() {
    std::vector<int> a = {2,2,2,1,7,7,7,5,5};
    auto [unique, indexes] = sort_unique (a);
    print (a, "input: ");
    print (unique, "unique values: ");
    print (indexes, "indexes: ");
    return 0;
}

Find the unique sorted values of a vector and also indexes of the sorted vector to regenerate the original vector

3 Answers3