7

What is the best way to delete elements from a vector given an another vector?

I have come up with the following code:

#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;

void remove_elements(vector<int>& vDestination, const vector<int>& vSource) 
{
    if(!vDestination.empty() && !vSource.empty())
    {
        for(auto i: vSource) {
            vDestination.erase(std::remove(vDestination.begin(), vDestination.end(), i), vDestination.end());
        }
    }
}

int main() 
{
    vector<int> v1={1,2,3};
    vector<int> v2={4,5,6};
    vector<int> v3={1,2,3,4,5,6,7,8,9};
    remove_elements(v3,v1);
    remove_elements(v3,v2);
    for(auto i:v3)
        cout << i << endl;
    return 0;
}

Here the output will be:

7
8
9
user3898160
  • 539
  • 4
  • 14
  • assuming that vectors will be large enough to justify even bothering with the optimizations, I would first transform `vDestination` to a `std::list` (or list of smart pointers?) to avoid costly removal from vector (because it has to be continuous) and back to `std::vector` at the end – slawekwin Aug 25 '16 at 07:44
  • @slawekwin: I strongly doubt that this will be faster for any realistic size of the vector. Lists require additional redirections and cannot be cached efficiently. (I am aware that element removal is O(1) for lists and O(N) for vectors.) – Frank Puffer Aug 25 '16 at 07:54
  • @Frank I am not sure about it as well, but I believe it would be an interesting experiment to measure it :) – slawekwin Aug 25 '16 at 07:57
  • 1
    Are all the vectors sorted? – Bob__ Aug 25 '16 at 08:00
  • 2
    might be worth checking out if creating an `unordered_set` (from `vSource`) first helps with efficiency. Theoretically should change worst case from `N * M` to `N + M` (assuming `N` is `vDestination` size and `M` is `vSource` size - one lookup per `vDestination` element instead of up to `M` lookups + set creation), but it depends on bucket configuration. Plus, you didn't mention what happens with duplicates - shall be removed once, or all of them? – Tomasz Lewowski Aug 25 '16 at 08:29
  • @Bob__: No the vectors are not sorted, however in example they are. – user3898160 Aug 25 '16 at 08:55

4 Answers4

6

My version is the following, I only apply erase after all elements from the vector vSource have been moved to the end by std::remove and keep track of the pointer to the end of the vector vDestination to not iterate over it for nothing.

void remove_elements(vector<int>& vDestination, const vector<int>& vSource) 
{
    auto last = std::end(vDestination);
    std::for_each(std::begin(vSource), std::end(vSource), [&](const int & val) {
        last = std::remove(std::begin(vDestination), last, val);
    });
    vDestination.erase(last, std::end(vDestination));
}

See on coliru : http://coliru.stacked-crooked.com/a/6e86893babb6759c


Update

Here is a template version, so you don't care about the container type :

template <class ContainerA, class ContainerB>
void remove_elements(ContainerA & vDestination, const ContainerB & vSource) 
{
    auto last = std::end(vDestination);
    std::for_each(std::begin(vSource), std::end(vSource), [&](typename ContainerB::const_reference val) {
        last = std::remove(std::begin(vDestination), last, val);
    });
    vDestination.erase(last, std::end(vDestination));
}

Note

This version works for vectors without any constraints, if your vectors are sorted you can take some shortcuts and avoid iterating over and over the vector to delete each element.

dkg
  • 1,775
  • 14
  • 34
  • Or with [this](http://coliru.stacked-crooked.com/a/df3f23f7aa6f7f54) "STL-elegant" psychedelics. – ilotXXI Aug 25 '16 at 08:09
  • @ilotXXI not sure if "remove_if" is necessary here. Considering "remove" does kind of the same for all elements with the same value. – Hayt Aug 25 '16 at 08:20
  • 1
    @dkg: Thanks for the answer. How did you checked this is more efficient? – user3898160 Aug 25 '16 at 08:21
  • @Hayt, what do you mean? It iterates different values, not the same. – ilotXXI Aug 25 '16 at 08:38
  • ah you are right. sorry I kind of missed you switched the inner/outer "loop". Would be interessting to see if there are performance changes. I have no access to a compiler at the moment though. I can also see for "larger" vectors to put the source vector values into an unordered_set and just do a lookup if they are there to avoid repeated iteration over the source vector. – Hayt Aug 25 '16 at 08:41
3

If your vectors are always sorted, you can use set_difference:

#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>

void remove_elements(std::vector<int>& vDestination, const std::vector<int>& vSource) 
{
    std::vector<int> result;
    std::set_difference(vDestination.begin(), vDestination.end(), vSource.begin(), vSource.end(), std::back_inserter(result));
    vDestination.swap(result);
}

int main() 
{
    std::vector<int> v1={1,2,3};
    std::vector<int> v2={4,5,6};
    std::vector<int> v3={1,2,3,4,5,6,7,8,9};
    remove_elements(v3,v1);
    remove_elements(v3,v2);
    for(auto i:v3)
        std::cout << i << '\n';
}

If not for reqirement, that output range should not ovelap with any input range, we could even avoid additional vector. Potentially you can roll your own version of set_difference which is allowed to output in range starting with vDestination.begin(), but it is outside of scope of this answer.

Revolver_Ocelot
  • 8,609
  • 3
  • 30
  • 48
3

I assume that by best you mean fastest that works. Since it's a question about efficiency, I performed a simple benchmark to compare efficiency of several algorithms. Note that they differ a little, since the problem is a bit underspecified - the questions that arise (and assumptions taken for benchmark) are:

  • is it guaranteed that vDestination contains all elements from vSource ? (assumption: no)
  • are duplicates allowed in either vDestination or vSource ? (assumption: yes, in both)
  • does the order of the elements in the result vector matter? (algorithms for both cases tested)
  • should every element from vDestination be removed if it is equal to any element from vSource, or only one-for-one? (assumption: yes, in both)
  • are sizes of vDestination and vSource somehow bounded? Is one of them always bigger or much bigger? (several cases tested)
  • in the comments it's already explained that vectors don't need to be sorted, but I've included this point, as it's not immediately visible from the question (no sorting assumed in either of vectors)

as you see, there are a few points in which algorithms will differ and consequently, as you can guess, best algorithm will depend on your use case. Compared algorithms include:

  1. original one (proposed in question) - baseline
  2. proposed in @dkg answer
  3. proposed in @Revolver_Ocelot answer + additional sorting (required by the algorithm) and pre-reservation of space for result vector
  4. proposed in @Jarod42 answer
  5. set-based algorithm (presented below - mostly optimization of @Jarod42 algorithm)
  6. counting algorithm (presended below)

set-based algorithm:

std::unordered_set<int> elems(vSource.begin(), vSource.end());
auto i = destination.begin();
auto target = destination.end();
while(i <= target) {
    if(elems.count(*i) > 0) 
        std::swap(*i, *(--target));
    else
        i++;
}
destination.erase(target, destination.end());

counting algorithm:

std::unordered_map<int, int> counts;     
counts.max_load_factor(0.3);     
counts.reserve(destination.size());      

for(auto v: destination) {     
    counts[v]++;     
}     

for(auto v: source) {     
    counts[v]--;     
}     

auto i = destination.begin();     
for(auto k: counts) {     
    if(k.second < 1) continue;            
    i = std::fill_n(i, k.second, k.first);     
}     
destination.resize(std::distance(destination.begin(), i));     

Benchmarking procedure was executed using Celero library and was the following:

  1. Generate n pseudo-random ints (n in set {10,100,1000,10000, 20000, 200000}) and put them to a vector
  2. Copy a fraction (m) of these ints to second vector (fractions from set {0.01, 0.1, 0.2, 0.4, 0.6, 0.8}, min. 1 element)
  3. Start timer
  4. Execute removal procedure
  5. Stop timer

Only algorithms 3, 5 and 6 were executed on datasets larger than 10 000 elements as the rest of them took to long for me to comfortably measure (feel free to do it yourself).

Long story short: if your vectors contain less than 1000 elements, pick whichever you prefer. If they are longer - rely on size of vSource. If it's less than 50% of vDestination - choose set-based algorithm, if it's more - sort them and pick @Revolver_Ocelot's solution (they tie around 60%, with set-based being over 2x faster for vSource being 1% size of vDestination). Please don't rely on order or provide a vector that is sorted from the beginning - requirement that ordering has to remain same slows the process down dramatically. Benchmark on your use case, your compiler, your flags and your hardware. I've attached link to my benchmarks, in case you wanted to reproduce them.

Complete results (file vector-benchmarks.csv) are available on GitHub together with benchmarking code (file tests/benchmarks/vectorRemoval.cpp) here.

Please keep in mind that these are results that I've obtained on my computer, my compiler etc. - in your case they will differ (especially when it comes to point in which one algorithm is better than another).

I've used GCC 6.1.1 with -O3 on Fedora 24, on top of VirtualBox.

Tomasz Lewowski
  • 1,935
  • 21
  • 29
1

Can be written with STL as:

void remove_elements(vector<int>& vDestination, const vector<int>& vSource) 
{
    const auto isInSource = [&](int e) {
        return std::find(vSource.begin(), vSource.end(), e) != vSource.end();
    };
    vDestination.erase(
        std::remove_if(vDestination.begin(), vDestination.end(), isInSource),
        vDestination.end());
}

if vSource is sorted, you may replace std::find by std::binary_search.

Jarod42
  • 203,559
  • 14
  • 181
  • 302