0

Is there an efficient way to compare two vectors doing STL type operations such that I don't have to sort them or copy them? The problem is that sorting causes me to have to create a lock on the getIntersection method and ideally I would like to avoid this as it is really just reading the data structure and finding data inside of it and not altering it. The sort method alters the data structure so other calls of the method need to be synchronized. I probably need to just create a copy but that might be a large copy but may be faster than locking but not sure. Therefore my question becomes whether searching the sorted vector is more efficient than just taking the price of the lock or a copy. Consider the following example:

class X
{


  public:

  struct TestX
  {
     long id;
     .......... // other items
  };


   void getIntersectionByID ( vector<TextX>& result, const vector<TestX>& ids)
   {
      return getItemsByIntersection<long,TestX>( result, _v1, ids, &TestX::id);
      return false; 
   }


   private:
    vector<TestX> _v1;  // assume this is populated with data
};


  // generic pred to do weak ordering on a structure by a generic field
// this is a generalized less than function which can be used for ordering
// and other equality operations
template<typename T, typename K>
struct byField
{
  public:
  byField(T K::* idMember) : idMember_(idMember) {}    

  bool operator() (const K& obj1, const K& obj2)
  {
    return ( obj1.*idMember_ < obj2.*idMember_ );
  }

  private:
  T K::* idMember_;     
};


    template <typename T, typename K>
bool getItemsByIntersection ( std::vector<K>& retds, std::vector<K>& ds, const std::vector<T>& values, T K::* field  )
{
  //create the vector of structs to use for comparison
  typename std::vector<K> searchCriteria(values.size());
  typename std::vector<K>::iterator itS =  searchCriteria.begin();

  // assign the item to the vector
  for (typename std::vector<T>::const_iterator it = values.begin(), itEnd = values.end(); it != itEnd; ++it,++itS)
  {
    (*itS).*field = *it;
  }

  // reserve half the size of the total ds just to be safe
  typename std::vector<K> tmp;
  tmp.reserve(ds.size()/2);

  sort( ds.begin(), ds.end(), byField<T,K>(field) );
  sort( searchCriteria.begin(), searchCriteria.end(), byField<T,K>(field) );

  setGrep ( ds.begin(), ds.end(), searchCriteria.begin(), searchCriteria.end(), std::back_inserter(tmp), byField<T,K>(field) );

 // don't change state until the very end, any existing contents in retds are destroyed
  retds.swap(tmp);

  if ( !retds.empty() )
  {
    return true;
  }

  return false;
}



    /  this is a set grep meaning any items that are in set one
    // will be pulled out if they match anything in set 2 based on operator pred 
    template<typename _InputIterator1, typename _InputIterator2,
      typename _OutputIterator, typename _Compare>
      _OutputIterator
    setGrep(_InputIterator1 __first1, _InputIterator1 __last1,
        _InputIterator2 __first2, _InputIterator2 __last2,
        _OutputIterator __result, _Compare __comp)
    {
      while (__first1 != __last1 && __first2 != __last2)
        if (__comp(*__first1, *__first2))
          ++__first1;
        else if (__comp(*__first2, *__first1))
          ++__first2;
        else
        {
          *__result = *__first1;
          ++__first1;
          ++__result;
        } 
      return __result;
    }
bjackfly
  • 3,236
  • 2
  • 25
  • 38
  • 1
    Any reason you're not using `std::set` directly? Do you need the contiguous memory guarantees or constant time random access of `std::vector'? Using `std::set` amortizes the `sort` complexity over the number of `insert` operations that happen, so you don't need to specifically sort it. – Chad Jun 18 '12 at 14:49
  • Perhaps you can copy the data, then sort the copy (or sort an array of pointers to the original data, if copying it is expensive)? – Jerry Coffin Jun 18 '12 at 14:54
  • [I'm 99% sure you are using C++ standard library, not STL](http://stackoverflow.com/questions/5205491/whats-this-stl-vs-c-standard-library-fight-all-about/5205571#5205571). – Griwes Jun 18 '12 at 15:09

2 Answers2

2

If you have small vectors you can write something that does the trick, but if the vectors are not sorted there's no way to avoid the n*n comparisons. Imagine you have 1,000,000 elements in both vectors, that's 1,000,000,000,000 comparison operations.

If you just need the equal/not equal you can copy both, sort the copies, compare them and destroy the copies...

Doncho Gunchev
  • 2,159
  • 15
  • 21
  • Or if you can afford the copies, put the content of both vectors in a set, and rely on the unicity of the set to check for its size? – piwi Jun 18 '12 at 15:00
  • Thanks for this explanation, yeah.. a sort is definitely necessary – bjackfly Jun 18 '12 at 18:20
1

You could take copies. Either in the obvious way copying as vectors and then sorting, or if the vectors are likely to contain a lot of dupes:

std::set<T,pred> s1(v1.begin(), v1.end());
std::set<T,pred> s2(v2.begin(), v2.end());
std::set_intersection(s1.begin(), s1.end(), s2.begin(), s2.end(), std::back_inserter(tmp), pred());

It might be faster to use unordered_set instead, and also less memory since you only need a "copy" of one of the collections. You'd have to write a hash function, though, which might not be easy depending what your predicate does. You'd also have to write the intersection code, but that's simple.

Other potential options: sort v1 immediately after it is finished being populated; have X use a set instead of a vector; supply the criteria as a set instead of a vector. Whether they're applicable or not depends whether X and/or the caller can see pred. And as above, if you can write a hasher then you can replace set with unordered_set.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
  • Sets probably won't work for what I am doing. To be more clear I am not actually using set_intersection but something I called setGrep which is very similar to set intersection except that I want all occurances including dups not just the first one. Also my pred is a finder method that takes in an argument of a pointer to a structure member as the vectors are vectors of structs. Therefore the user can pass in a vector of search structs and the code will grep out from the vector items that match based on a criteria. I will update the original code. – bjackfly Jun 18 '12 at 15:25
  • @bjackfly: Even if the details are different, I think the principle still applies, which is to process a copy. – Steve Jessop Jun 18 '12 at 15:39
  • probably processing a copy is better than a lock? The sort seems necessary after looking at the posts. – bjackfly Jun 18 '12 at 17:03
  • If the sort is always by a field of `T`, and assuming `T` is big enough to warrant it, perhaps you could create one vector per field which contains the indexes of the elements of the vector. Then sort the indexes according to the field. That way you're sharing one copy of the vector between multiple different filters involving the same field. These per-field sorts could perhaps be created as required and cached. If `T` is small enough, you might not need a vector of indexes, multiple copies of the whole thing aren't (much) bigger. – Steve Jessop Jun 18 '12 at 17:18
  • Thanks I could do something like this, but it was to be able to generically search the vector of structs by using any one of the struct items as a key in this way I can keep filtering down the list and keep calling the setIntersection with the filtered list as the next input. The code is working I really just need the first copy of the data member that holds all the data so I will probably just do that and see if the performance is accpetable. Thanks again. – bjackfly Jun 18 '12 at 18:20