Have many vectors sorted together

Question

I have three vectors of the same size (~ 1 million items):

std::vector<wstring> name;
std::vector<int> x;
std::vector<int> y;

which can be seen as three "columns".

How to sort A->Z the vector name:

std::sort(name.begin(), name.end())

but having the vectors x and y sorted accordingly?

Example:

name  x  y                 name  x  y
BCD   7  9                 ABC   4  3
ZYX   1  4        =>       BCD   7  9
ABC   4  3                 ZYX   1  4

The good thing about using a std::vector, is that I can easily select/filter a few items in the big vector by taking just a list of index to keep (example: let's keep items 12, 1872, 2834, 1831). I thought about using a std::map but I fear it won't be as efficient for this: how to keep a list of elements to keep in a map?

@Nemo this question looks complicated with Boost (that I don't use), the general wording is more complex (with an answer using Boost included at the end of the question...), and moreover the accepted answer uses https://github.com/ericniebler/range-v3, requires C++14, etc. Are you sure the question you point is really cleaner than this one? Even the title about STL is probably obvious for a confirmed user, but most intermediate people will only see Boost. Maybe this other question would require some cleaning, in order to be a good "master" question for this "slave" one, don't you think so? — Basj, Jul 12 '17 at 23:47
Well, the question is certainly an exact duplicate... But I agree some of these answers are useful. Unfortunately, I know of no way to represent this sort of case on SO (and it is not the first time I have seen it). I have added a link back here in a comment. You are free to add an actual answer over there if you like; I would even upvote it — Nemo, Jul 13 '17 at 16:21

Justin · Accepted Answer · 2017-07-12T23:51:19.800

4

There are a couple possible ways to do this. The easiest way is to wrap name, x, and y in a struct:

struct Person {
    std::wstring name;
    int x;
    int y;
};

Then you can have a std::vector<Person> people and sorting it would be (assuming C++14)

std::sort(people.begin(), people.end(),
    [](auto const& lhs, auto const& rhs) { return lhs.name < rhs.name; });

However, if you know that this will cause performance problems due to fewer elements fitting in the cache (that is, you'd frequently iterate over only x or only y and you are in a very constrained environment such as high performance gaming), I'd suggest only sorting one vector. Unless you know what you're doing, you'd need to benchmark both options.

Basically, have a vector that keeps track of the ordering:

std::vector<std::wstring> name;
std::vector<int> x;
std::vector<int> y

std::vector<std::size_t> ordering(name.size());
std::iota(ordering.begin(), ordering.end(), 0);

std::sort(ordering.begin(), ordering.end(),
    [&](auto const& lhs, auto const& rhs) {
        return name[lhs] < name[rhs];
    });

Then you can simply iterate over ordering to go through each parallel vector in the new order.

It's possible that the extra level of indirection will make it less efficient. For example, the CPU might think that there's a data dependency where there is none. Furthermore, the extra data we are keeping track of in ordering could easily take enough room in the cache to counteract the benefit of separating name, x, and y; you'd need to know the specifications of your target architecture and profile to be sure.

If you would want to keep iterating over them in this new order, you would want to use this ordering vector to sort the other vectors, because the access to the elements would become random. That would counteract the benefit of keeping the vectors separate (unless the vectors are small enough to fit in the cache anyway).

The easiest way to do that would be to create a new vector:

std::vector<std::wstring> newNames;
newNames.reserve(name.size());

for (auto i : ordering) {
    newNames.push_back(name[i]);
}

Reconstructing the vectors like this is probably what you want to do if the sorting happens during initialization.

edited Jul 12 '17 at 23:51

answered Jul 12 '17 at 23:02

Justin

24,288
12
92
142

Thanks for this answer! Just to be sure, why the & in `std::sort(ordering.begin(), ordering.end(), [&]...`? – Basj Jul 12 '17 at 23:16
@Basj to pass the vectors by reference into the lambda – yizzlez Jul 12 '17 at 23:18
@Basj The `[...](...){...}` syntax is a lambda. Inside the `[]` you declare what variables you capture. Using a `[=]` means you capture everything you need by value. Using a `[&]` means you capture everything you need by reference. So what this does is it captures `name` by reference, as we need to use it in the body of the lambda itself – Justin Jul 12 '17 at 23:18
@Justin Thanks. But why `[]` in the first code and `[&]` for the second? – Basj Jul 12 '17 at 23:19
Also, for non (not-so)-new-c++-familiar-people like me, how would this translate without `auto` @Justin? – Basj Jul 12 '17 at 23:20
@Basj If you can't use `auto` you also can't use the lambda. You have to make a separate function or function-object. Minor, but non-trivial modification. For more on Lambda, read here: http://en.cppreference.com/w/cpp/language/lambda – user4581301 Jul 12 '17 at 23:22
1

@Basj If you are using C++11, you have to write out the type you are using in the comparators. So that would be `std::wstring` in the first case, and `std::size_t` in the second case. – Justin Jul 12 '17 at 23:23
@Basj In the first case, we don't need to capture anything. The thing we want to sort by is part of the objects we are sorting themselves, so we can get the data from the parameters passed in. In the second case, we want to sort by something external, so we need to capture it – Justin Jul 12 '17 at 23:24
@Justin for better understanding, can I edit (or you) and replace auto by the relevant types? It would help to focus on the actual types of the data which are compared in the lambda. – Basj Jul 12 '17 at 23:29
@Basj http://coliru.stacked-crooked.com/a/19c88f69b569c3b8 . I did make a typo, it's supposed to be a `Person` in the first case – Justin Jul 12 '17 at 23:33
2

The second option adds another vector (extra space), plus an indirection for every access, plus it makes sequential access into random when iterating... For all of these reasons, it could wind up being slower than the first option. It is also considerably more complex. I would strongly suggest benchmarking both options before going with the second. – Nemo Jul 12 '17 at 23:41
Last quesiton @Justin about the 2nd solution: why pass a pointer / reference to a size_t whereas passing a size_t itself would be even easier. (i.e. why dealing on reference of an int, and not pass the actual value)? Wouldn't there be a way without all the & ? – Basj Jul 12 '17 at 23:43
@Basj For the comparators, that's because I've gotten in the habit of usually taking parameters like that by `auto const&`, so that if I end up using something that's bigger than a primitive, there'll be no expensive copy. The compiler will optimize the comparator and change it to take the parameters by value if it finds that that would be more efficient (and indeed it probably would) – Justin Jul 12 '17 at 23:46
@Justin thanks. Would the by value version be `std::sort(ordering.begin(), ordering.end(), [](std::size_t lhs, std::size_t rhs) { return name[lhs] < name[rhs]; });` or it would still require a & (or something else) somewhere ? – Basj Jul 12 '17 at 23:53
1

@Basj That would yield a compiler error due to the compiler not knowing what `name` is; the by value version would be `std::sort(ordering.begin(), ordering.end(), [&](std::size_t lhs, std::size_t rhs) { return name[lhs] < name[rhs]; });` – Justin Jul 12 '17 at 23:54

score 1 · Answer 2 · answered Jul 12 '17 at 23:13

It sounds like you want a struct to keep the data together. For example:

struct MyData
{
  wstring name;
  int x;
  int y;
};

...
std::vector<MyData> data;

From there, you'll want a comparison function to do a custom sort that ensures you are sorting from the field you want to sort by:

std::sort(data.begin(), data.end(), compareByName);

bool compareByName(const MyData& lhs, const MyData& rhs)
{
    return lhs.name < rhs.name; // This can be whatever
}

Have many vectors sorted together

2 Answers2

Linked