Is there a safe (defined behaviour) way to use the STL to reduce the boilerplate of efficiently filtering a vector based off its indices?

Question

I very often find myself wanting to filter a vector based on its index rather than its values.

auto some_values = std::vector{1, 0, 4, 6, 2};

// Somewhere I figure out which items to remove.
// This may be from user interaction or something
// nothing to do with the original values:
const auto removing = std::vector<bool>{0, 0, 1, 0, 1};

So, I'm tempted to use erase_if like so:

std::erase_if(some_values, [it = removing.begin()](auto) mutable {return *it++;});

It seems to works for gcc and clang. However, there doesn't seem to be anything on the std::erase cppref page about the order in which the predicate is called, so I assume this is undefined behaviour?

Same issue with std::remove_if. Note that zipping the ranges won't work for most zipping options as usually the resultant range can't resize the underlying data.

Using a for loop and creating a copy of the array isn't too much boilerplate, but I'm currently applying filters to some low-ish level code where I can't just copy all the data. Some of the data will be huge and needs to be responsive during these filter operations.

Worst-case scenario I can add functions like this to solve the problem:

template <class T, auto N, class Pred>
size_t RemoveIfIndex(std::span<T, N> span, Pred&& pred)
{
  size_t write = 0; // Also behaves as a counter
  
  // Initialise to the first deleted index:
  for (; (write < span.size()) && !pred(write); ++write);
  
  // Shuffle the elements:
  for (size_t read = write + 1; read < span.size(); ++read)
  {
    if (!pred(read))
    {
      span[write] = span[read];
      ++write;
    }
  }
  return write;
}

template <class T, class Alloc, class Pred>
size_t EraseIfIndex(std::vector<T, Alloc>& vec, Pred&& pred)
{
  const size_t new_size = RemoveIfIndex(std::span{vec.begin(), vec.end()}, pred);
  vec.resize(new_size);
  return new_size;
}

But I'm hesitant to add more code to our low-level library when there's similar categories of functions I'll need to add (that need index info), and I suspect there's some insight into ranges/adaptors/filters that I'm missing that'd make these functions unnecessary.

It's indeed not guaranteed to work, but FWIW clang, gcc and msvc all have the exact same implementation you have. It's hard to have a different one, considering the standard requires exactly the same number of predicate evaluations and elements. — Passer By, Apr 21 '23 at 05:44
Shouldn't the second parameter of `EraseIfIndex` be an array holding `bool`s instead of a function? — 康桓瑋, Apr 21 '23 at 05:44
@康桓瑋, I'd use it like [this](https://godbolt.org/z/5Mevnj8xT). That way it works if you're dependent on other data in other ways (including co-dependent on the original data). — Elliott, Apr 21 '23 at 05:54
The index of `x` in `vx` is `&x - &vx[0]`, guaranteed. Why rely on something that is not guaranteed? (Actually addressof and not &). — n. m. could be an AI, Apr 21 '23 at 06:51
Have a look at the proposed `std::hive` which solves this problem run-time efficiently (at the cost of more space) — davidhigh, Apr 21 '23 at 08:10
@n.m. Algorithms aren't guaranteed to pass in the original object. The predicate is specifically required to evaluate to the same thing with any equivalent object. — Passer By, Apr 21 '23 at 09:20
So you just need to discard some elements at specific indice, right? Because if that is the case, I have a ridiculously simple solution. — Red.Wave, Apr 21 '23 at 10:04
@n.m. You know, I'm not so sure after checking around. That was something I read a long time ago. [This](https://timsong-cpp.github.io/cppwp/algorithms.parallel.user#1) explicitly states it for parallel algorithms, but there's only a lack of guarantee for sequential algorithms. — Passer By, Apr 21 '23 at 10:29
@ildjarn That's about the predicate object, not the arguments. And notes aren't normative anyways. — Passer By, Apr 23 '23 at 02:19
@PasserBy : *Foot*notes aren't normative; this is a normative clarification clause. Best to know the difference if you're looking for standardese. And I guess I misunderstood (and mis-upvoted) you if you weren't referring to the predicates themselves, sorry. — ildjarn, Apr 23 '23 at 05:14
@ildjarn [Notes are not normative](https://stackoverflow.com/questions/21364398/are-notes-and-examples-in-the-core-language-specification-of-the-c-standard-no). — Passer By, Apr 23 '23 at 07:07

Jerry Coffin · Answer 1 · 2023-04-21T16:44:15.897

It may or may not qualify as what you think of "STL", but std::valarray supports mask arrays like this a little more generally, so you can do something like this:

#include <valarray>
#include <iostream>

int main() {
    std::valarray<int> values { 1, 2, 3, 4, 5 };
    std::valarray<bool> mask  { 0, 1, 1, 0, 1 };

    // zero all the values where there's a 1 in the mask array
    values[mask] = 0;

    // sum where there were zeros in the mask array:
    std::cout << values.sum(); // result: 5 (1+4)
}

This can also work with the contents of the array. For example, you can specify a condition (where the name of the valarray acts as kind of a placeholder):

#include <valarray>
#include <iostream>

int main() {
    std::valarray<int> values { 1, 2, 3, 4, 5 };

    values[values > 3] = 0;
    // values = { 1, 2, 3, 0, 0 };
    std::cout << values.sum(); // 6 (1+2+3)
}

So it's not usually for removing elements like you're talking about, but depending on the situation, you may be able to get a similar effect, basically ignoring the elements you don't need/want.

Disclaimer: mostly posting this as something interesting that few people know about. Honestly, it's open to a lot of question how useful it really is--valarray can do some kind of cool things, but it also has some rather strange limitations. For example, from looking at the preceding code, it might initially seem like you should be able to do something like this:

#include <valarray>
#include <iostream>

int main() {
    std::valarray<int> values { 1, 2, 3, 4, 5 };
    std::valarray<bool> mask  { 0, 1, 1, 0, 1 };

    std::cout << values[mask].sum();
}

...but no, that won't work--it'll normally give an error to the effect that std::mask_array doesn't have a member function named sum().

There are pretty good reasons valarray is so rarely used.

score 1 · Answer 2 · answered Apr 28 '23 at 05:13

1

Simple C++

std::erase_if(some_values, 
              [&removing, beginPtr = &*some_values.cbegin()](const int& i) 
              {
                return removing[&i - beginPtr];
              });

Using boost

#include <boost/tuple/tuple.hpp>
#include <boost/range/combine.hpp>
#include <boost/range/algorithm/remove_if.hpp>

const auto it = boost::remove_if(
    boost::combine(some_values, removing),
    [](const auto& Elem)
    {
        return boost::get<1>(Elem);
    });
some_values.erase(
    boost::get<0>(it.get_iterator_tuple()),
    some_values.end());

Using std::ranges

See https://stackoverflow.com/a/67845735/266742

answered Apr 28 '23 at 05:13

RyanCu

516
4
12

`std::erase_if` provides no guarantees that the address of the value passed to the predicate is unchanged. I haven't looked into the boost APIs, but I think the "Simple C++" example has undefined behaviour. Would you agree? – Elliott Apr 28 '23 at 07:06
No, I think it's defined. The standard defines `erase_if` with reference to `remove_if` [vector.erasure], and it requires that `remove_if` apply a dereferenced iterator to the predicate. [alg.remove-1.2] and https://eel.is/c++draft/algorithms.requirements#6 – RyanCu Apr 28 '23 at 12:24

score 0 · Answer 3 · answered Apr 21 '23 at 08:04

You are right that your erase_if call is not guaranteed to work.

Depending on what you need the filtered elements for, it might be viable to not erase them but rather build a table to look them up in the original vector. Something along the line of ...

#include <vector>
#include <iostream>

template <typename T>
struct filtered_vector {
    std::vector<T>& data;
    std::vector<size_t> indices;
    size_t s;
    filtered_vector(std::vector<T>& data,const std::vector<bool>& removing) : data(data) {
        size_t count = 0;
        for (const auto& r : removing) {
            if (!r) indices.push_back(count);
            ++count;
        }
    }
    T& operator[](size_t index) {
        return data[indices[index]];
    }
    size_t size() { return indices.size();}
};

int main() {
    auto some_values = std::vector{1, 0, 4, 6, 2};
    const auto removing = std::vector<bool>{0, 0, 1, 0, 1};
    filtered_vector<int> f{some_values,removing};
    for (size_t i=0;i<f.size();++i) std::cout << f[i] << " ";
}

As a sidenote, if you do erase and insert elements often but at the same time require stable references to other elements, it might be worth to consider std::list with stable iterators.

Pepijn Kramer · Answer 4 · 2023-04-21T06:47:01.547

== warning incorrect example, left for educational purpose == (remove_if moves elements around in memory so indices will not line up during iteration)

Why worst case? Having a working function is a good way to work. It is not always about the minimum lines of code. I would create a function anyway to check for boundary conditions (e.g. vector sizes should match)

#include <algorithm>
#include <iostream>
#include <vector>
#include <stdexcept>

void remove_by_index(std::vector<int>& values, const std::vector<bool>& filter)
{
    if (values.size() != filter.size()) throw std::invalid_argument("");
    std::size_t index{ 0 };
    auto it = std::remove_if(values.begin(), values.end(), [&](const auto) { return filter[index++]; });
    values.erase(it, values.end());
}

int main()
{
    auto some_values = std::vector{ 1, 0, 4, 6, 2 };
    const auto remove = std::vector<bool>{ 0, 0, 1, 0, 1 };

    remove_by_index(some_values, remove);

    for (const auto value : some_values)
    {
        std::cout << value << " ";
    }

    return 0;
}

This is UB, the lambda will get different results every time it is called. — 康桓瑋, Apr 21 '23 at 06:09
You are right it is UB, But is not the lambda that's the problem, it is the remove if, that starts shifting elements of values in memory and the items will not be at the expected indices anymore. So all in all, I'd probably go for a copy_if approach — Pepijn Kramer, Apr 21 '23 at 06:45

Is there a safe (defined behaviour) way to use the STL to reduce the boilerplate of efficiently filtering a vector based off its indices?

4 Answers4

Simple C++

Using boost

Using std::ranges