How to efficiently delete elements from a vector given an another vector

Question

What is the best way to delete elements from a vector given an another vector?

I have come up with the following code:

#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;

void remove_elements(vector<int>& vDestination, const vector<int>& vSource) 
{
    if(!vDestination.empty() && !vSource.empty())
    {
        for(auto i: vSource) {
            vDestination.erase(std::remove(vDestination.begin(), vDestination.end(), i), vDestination.end());
        }
    }
}

int main() 
{
    vector<int> v1={1,2,3};
    vector<int> v2={4,5,6};
    vector<int> v3={1,2,3,4,5,6,7,8,9};
    remove_elements(v3,v1);
    remove_elements(v3,v2);
    for(auto i:v3)
        cout << i << endl;
    return 0;
}

Here the output will be:

7
8
9

assuming that vectors will be large enough to justify even bothering with the optimizations, I would first transform `vDestination` to a `std::list` (or list of smart pointers?) to avoid costly removal from vector (because it has to be continuous) and back to `std::vector` at the end — slawekwin, Aug 25 '16 at 07:44
@slawekwin: I strongly doubt that this will be faster for any realistic size of the vector. Lists require additional redirections and cannot be cached efficiently. (I am aware that element removal is O(1) for lists and O(N) for vectors.) — Frank Puffer, Aug 25 '16 at 07:54
@Frank I am not sure about it as well, but I believe it would be an interesting experiment to measure it :) — slawekwin, Aug 25 '16 at 07:57
might be worth checking out if creating an `unordered_set` (from `vSource`) first helps with efficiency. Theoretically should change worst case from `N * M` to `N + M` (assuming `N` is `vDestination` size and `M` is `vSource` size - one lookup per `vDestination` element instead of up to `M` lookups + set creation), but it depends on bucket configuration. Plus, you didn't mention what happens with duplicates - shall be removed once, or all of them? — Tomasz Lewowski, Aug 25 '16 at 08:29
@Bob__: No the vectors are not sorted, however in example they are. — user3898160, Aug 25 '16 at 08:55

dkg · Answer 1 · 2016-08-25T08:01:35.287

6

My version is the following, I only apply erase after all elements from the vector vSource have been moved to the end by std::remove and keep track of the pointer to the end of the vector vDestination to not iterate over it for nothing.

void remove_elements(vector<int>& vDestination, const vector<int>& vSource) 
{
    auto last = std::end(vDestination);
    std::for_each(std::begin(vSource), std::end(vSource), [&](const int & val) {
        last = std::remove(std::begin(vDestination), last, val);
    });
    vDestination.erase(last, std::end(vDestination));
}

See on coliru : http://coliru.stacked-crooked.com/a/6e86893babb6759c

Update

Here is a template version, so you don't care about the container type :

template <class ContainerA, class ContainerB>
void remove_elements(ContainerA & vDestination, const ContainerB & vSource) 
{
    auto last = std::end(vDestination);
    std::for_each(std::begin(vSource), std::end(vSource), [&](typename ContainerB::const_reference val) {
        last = std::remove(std::begin(vDestination), last, val);
    });
    vDestination.erase(last, std::end(vDestination));
}

Note

This version works for vectors without any constraints, if your vectors are sorted you can take some shortcuts and avoid iterating over and over the vector to delete each element.

edited Aug 25 '16 at 08:01

answered Aug 25 '16 at 07:52

dkg

1,775
14
34

Or with [this](http://coliru.stacked-crooked.com/a/df3f23f7aa6f7f54) "STL-elegant" psychedelics. – ilotXXI Aug 25 '16 at 08:09
@ilotXXI not sure if "remove_if" is necessary here. Considering "remove" does kind of the same for all elements with the same value. – Hayt Aug 25 '16 at 08:20
1

@dkg: Thanks for the answer. How did you checked this is more efficient? – user3898160 Aug 25 '16 at 08:21
@Hayt, what do you mean? It iterates different values, not the same. – ilotXXI Aug 25 '16 at 08:38
ah you are right. sorry I kind of missed you switched the inner/outer "loop". Would be interessting to see if there are performance changes. I have no access to a compiler at the moment though. I can also see for "larger" vectors to put the source vector values into an unordered_set and just do a lookup if they are there to avoid repeated iteration over the source vector. – Hayt Aug 25 '16 at 08:41

score 3 · Answer 2 · answered Aug 25 '16 at 08:53

If your vectors are always sorted, you can use set_difference:

#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>

void remove_elements(std::vector<int>& vDestination, const std::vector<int>& vSource) 
{
    std::vector<int> result;
    std::set_difference(vDestination.begin(), vDestination.end(), vSource.begin(), vSource.end(), std::back_inserter(result));
    vDestination.swap(result);
}

int main() 
{
    std::vector<int> v1={1,2,3};
    std::vector<int> v2={4,5,6};
    std::vector<int> v3={1,2,3,4,5,6,7,8,9};
    remove_elements(v3,v1);
    remove_elements(v3,v2);
    for(auto i:v3)
        std::cout << i << '\n';
}

If not for reqirement, that output range should not ovelap with any input range, we could even avoid additional vector. Potentially you can roll your own version of set_difference which is allowed to output in range starting with vDestination.begin(), but it is outside of scope of this answer.

Tomasz Lewowski · Accepted Answer · 2016-08-27T22:07:56.693

I assume that by best you mean fastest that works. Since it's a question about efficiency, I performed a simple benchmark to compare efficiency of several algorithms. Note that they differ a little, since the problem is a bit underspecified - the questions that arise (and assumptions taken for benchmark) are:

is it guaranteed that vDestination contains all elements from vSource ? (assumption: no)
are duplicates allowed in either vDestination or vSource ? (assumption: yes, in both)
does the order of the elements in the result vector matter? (algorithms for both cases tested)
should every element from vDestination be removed if it is equal to any element from vSource, or only one-for-one? (assumption: yes, in both)
are sizes of vDestination and vSource somehow bounded? Is one of them always bigger or much bigger? (several cases tested)
in the comments it's already explained that vectors don't need to be sorted, but I've included this point, as it's not immediately visible from the question (no sorting assumed in either of vectors)

as you see, there are a few points in which algorithms will differ and consequently, as you can guess, best algorithm will depend on your use case. Compared algorithms include:

original one (proposed in question) - baseline
proposed in @dkg answer
proposed in @Revolver_Ocelot answer + additional sorting (required by the algorithm) and pre-reservation of space for result vector
proposed in @Jarod42 answer
set-based algorithm (presented below - mostly optimization of @Jarod42 algorithm)
counting algorithm (presended below)

set-based algorithm:

std::unordered_set<int> elems(vSource.begin(), vSource.end());
auto i = destination.begin();
auto target = destination.end();
while(i <= target) {
    if(elems.count(*i) > 0) 
        std::swap(*i, *(--target));
    else
        i++;
}
destination.erase(target, destination.end());

counting algorithm:

std::unordered_map<int, int> counts;     
counts.max_load_factor(0.3);     
counts.reserve(destination.size());      

for(auto v: destination) {     
    counts[v]++;     
}     

for(auto v: source) {     
    counts[v]--;     
}     

auto i = destination.begin();     
for(auto k: counts) {     
    if(k.second < 1) continue;            
    i = std::fill_n(i, k.second, k.first);     
}     
destination.resize(std::distance(destination.begin(), i));

Benchmarking procedure was executed using Celero library and was the following:

Generate n pseudo-random ints (n in set {10,100,1000,10000, 20000, 200000}) and put them to a vector
Copy a fraction (m) of these ints to second vector (fractions from set {0.01, 0.1, 0.2, 0.4, 0.6, 0.8}, min. 1 element)
Start timer
Execute removal procedure
Stop timer

Only algorithms 3, 5 and 6 were executed on datasets larger than 10 000 elements as the rest of them took to long for me to comfortably measure (feel free to do it yourself).

Long story short: if your vectors contain less than 1000 elements, pick whichever you prefer. If they are longer - rely on size of vSource. If it's less than 50% of vDestination - choose set-based algorithm, if it's more - sort them and pick @Revolver_Ocelot's solution (they tie around 60%, with set-based being over 2x faster for vSource being 1% size of vDestination). Please don't rely on order or provide a vector that is sorted from the beginning - requirement that ordering has to remain same slows the process down dramatically. Benchmark on your use case, your compiler, your flags and your hardware. I've attached link to my benchmarks, in case you wanted to reproduce them.

Complete results (file vector-benchmarks.csv) are available on GitHub together with benchmarking code (file tests/benchmarks/vectorRemoval.cpp) here.

Please keep in mind that these are results that I've obtained on my computer, my compiler etc. - in your case they will differ (especially when it comes to point in which one algorithm is better than another).

I've used GCC 6.1.1 with -O3 on Fedora 24, on top of VirtualBox.

score 1 · Answer 4 · answered Aug 25 '16 at 08:40

1

Can be written with STL as:

void remove_elements(vector<int>& vDestination, const vector<int>& vSource) 
{
    const auto isInSource = [&](int e) {
        return std::find(vSource.begin(), vSource.end(), e) != vSource.end();
    };
    vDestination.erase(
        std::remove_if(vDestination.begin(), vDestination.end(), isInSource),
        vDestination.end());
}

if vSource is sorted, you may replace std::find by std::binary_search.

answered Aug 25 '16 at 08:40

Jarod42

203,559
14
181
302

this solution is also in the first comment from the answer of @dkg – Hayt Aug 25 '16 at 08:46
@Hayt: Indeed, I didn't follow the link. – Jarod42 Aug 25 '16 at 08:50

How to efficiently delete elements from a vector given an another vector

4 Answers4

Linked