The intersection of multiple sorted arrays

Question

From this, we know the method to solve the intersection of two sorted arrays. So how to get the intersection of multiple sorted arrays?

Based on the answers of two sorted arrays, we can apply it to multiple arrays. Here are the codes

vector<int> intersectionVector(vector<vector<int> > vectors){
    int vec_num = vectors.size();

    vector<int> vec_pos(vec_num);// hold the current position for every vector
    vector<int> inter_vec; // collection of intersection elements

    while (true){
        int max_val = INT_MIN;
        for (int index = 0; index < vec_num; ++index){
            // reach the end of one array, return the intersection collection
            if (vec_pos[index] == vectors[index].size()){
                return inter_vec;
            }

            max_val = max(max_val, vectors[index].at(vec_pos[index]));
        }

        bool bsame = true;
        for (int index = 0; index < vec_num; ++index){
            while (vectors[index].at(vec_pos[index]) < max_val){
                vec_pos[index]++; // advance the position of vector, once less than max value
                bsame = false;
            }
        }

        // find same element in all vectors
        if (bsame){
            inter_vec.push_back(vectors[0].at(vec_pos[0]));

            // advance the position of all vectors
            for (int index = 0; index < vec_num; ++index){
                vec_pos[index]++;
            }
        }
    }
}

Is any better approach to solve it?

Update1

From those two topics 1 and 2, it seem that Hash set is more efficient method to do that.

Update2

To improve the performance, maybe the min-heap can be used instead of vec_pos in my codes above. And the variable max_val holds the current max value of all vectors. So just compare the root value with max_val, if they are same, this element can be put into intersection list.

Baum mit Augen · Accepted Answer · 2017-02-10T00:05:10.700

5

To get the intersection of two sorted ranges, std::set_intersection can be used:

std::vector<int> intersection (const std::vector<std::vector<int>> &vecs) {

    auto last_intersection = vecs[0];
    std::vector<int> curr_intersection;

    for (std::size_t i = 1; i < vecs.size(); ++i) {
        std::set_intersection(last_intersection.begin(), last_intersection.end(),
            vecs[i].begin(), vecs[i].end(),
            std::back_inserter(curr_intersection));
        std::swap(last_intersection, curr_intersection);
        curr_intersection.clear();
    }
    return last_intersection;
}

This looks a lot cleaner than your solution which is too confusing to check for correctness. It also has optimal complexity.

The standard library algorithm set_intersection may be implemented in any way that uses

at most 2·(N1+N2-1) comparisons, where N1 = std::distance(first1, last1) and N2 = std::distance(first2, last2).

first1 etc. are the iterators defining the input ranges. You can check out the actual implementation in the source code of your standard-library if it is open source (like libstd++ or libc++).

edited Feb 10 '17 at 00:05

answered Aug 26 '14 at 12:47

Baum mit Augen

49,044
25
144
182

Good answer, @Baum mit Augen, I want to know whether the algorithm used in the function `set_intersection` is same as I mentioned or not? – zangw Aug 26 '14 at 13:45
Yes, The set_intersection() algorithm constructs a sorted intersection of elements from the two ranges. It returns the end of the constructed range. When it finds an element present in both ranges, set_intersection() always copies the element from the first range into result. This means that the result of set_intersection() is guaranteed to be stable. The result of set_intersection() is undefined if the result range overlaps with either of the original ranges. – zangw Aug 26 '14 at 14:28
1

@zangw Do make sure this is efficient enough for your problem. It is going to iterate over each container twice instead of once (even the first, which it copies) which the "most optimal" solution would do. So only a factor of 2 slowdown there. It also allocates the size of the first container, even if we need far less data than that -- on the other hand, it does not allocate anything in the number of containers. – Yakk - Adam Nevraumont Aug 26 '14 at 15:04
`set_intersection` has optimal worst-case complexity, but that's almost never the right thing to look at. Galloping search is a better way to intersect two lists. – tmyklebu Aug 26 '14 at 17:10
1

Did one of you actually *measure* a difference or are you just assuming things? (I don't say my solution is the fastest, but with even Alexandrescou saying that he needs to measure everything, abandoning the easiest solution for performance reasons without actual data always seems like premature optimization to me.) – Baum mit Augen Aug 26 '14 at 17:13
1

@tmyklebu, Is there any example I can refer to? – zangw Aug 27 '14 at 02:43
@zangw: "Galloping search" is described in Cormack, Clarke, and Buettcher's book, for instance. And probably a million other places. It's a standard trick in information retrieval. – tmyklebu Aug 27 '14 at 07:45
1

@BaummitAugen: Yes, I've implemented both the algorithm you've stated and galloping search. Galloping search wins handily on quite a number of workloads because it doesn't inspect every item in the list. My main point is that, while the algorithm you describe is simple and your implementation is elegant, set intersection is a rather special problem in that we can get a linear-worst-case set intersection algorithm with substantially sublinear behaviour on data sets "typical" in many domains. – tmyklebu Aug 27 '14 at 07:46
Here's a ideone test in case someone want to play around with this: http://ideone.com/lFJqyj – Tomáš Zato Mar 14 '17 at 14:40
With some more years of experience, I'd like to walk back my comments from above a bit. :) I can well see other implementations be faster, and significantly so. However, I'll let it stand as a solution that should not be terribly slow and is easily verifiably correct; so if this function is not a main bottleneck of the overall program, it may well be suitable for use. – Baum mit Augen Mar 22 '21 at 10:18

score 2 · Answer 2 · answered Aug 26 '14 at 15:06

This assumes you know the number of containers you are intersecting:

template<class Output, class... Cs>
Output intersect( Output out, Cs const&... cs ) {
  using std::begin; using std::end;
  auto its = std::make_tuple( begin(cs)... );
  const auto ends = std::make_tuple( end(cs)... );
  while( !at_end( its, ends ) ) {
    if ( all_same( its ) ) {
      *out++ = *std::get<0>(its);
      advance_all( its );
    } else {
      advance_least( its );
    }
  }
  return out;
}

To complete simply implement:

bool at_end( std::tuple<Iterators...> const& its, std::tuple<Iterators...> const& ends );
bool all_same( std::tuple<Iterators...> const& its );
void advance_all( std::tuple<Iterators...>& its );
void advance_least( std::tuple<Iterators...>& its );

The first is easy (use indexes trick, compare pairwise, check that you returned true if the tuples are empty).

The second is similar. It should be easier if you compare std::get<i>(its) == std::get<i+1>(its) I think rather than compare all to zero. A special case for empty might be required.

advance_all is even easier.

The last is the tricky one. The requirements are that you advance at least one iterator, and you do not advance the one that dereferences the most, and you advance iterators at most once, and you advance the most you can up to efficiency.

I suppose the easiest method is to find the greatest element, the advance everything less than that by 1.

If you don't know the number of containers you are intersecting, the above can be refactored to use dynamic storage for the iteration. This will look similar to your own solution, except with the details factored out into sub functions.

The intersection of multiple sorted arrays

2 Answers2

Linked

Related