0

I want to compare elements of vectors filled with integers to look if for elements with same values (and count them).

So for example, if a[i]==x, is there a b[j]==x ?

The first implementation that came to mind, is the most simple one of course:

for (int i=0; i < a.size(); i++) {
  for (int j=0; j < b.size(); j++) {
    if (a[i]==b[j]) {counter++;}
  }

This is way to slow for larger vectors. I have thought of an alternating algorithm, but I'm to unskilled to implement it right, so here it is, what I've got yet and my problem:

  for (int i = 0; i < n; i++) {
    for (int j = 0; j < m; j++) {
      if (b[j] >= a[i]) {
        counter++;
        for (int k = i + 1; k < n; k++) {
          if (a[k] >= b[j+1]) {
            counter++;
            for (int l = k + 1; l < m; l++) {
              if (b[l] >= a[k]) {
                counter++;
                ....
              }
            }
          }
        }
      }
    }
  }

The flow is that I start comparing the first element of a with the elements of b. When I have a hit, i jump to vector b and compare its next element with the elements of a that come after the element of a I was comparing with before. (Since the elements of a and b are stored in ascending order, once the element of b I'm comparing with is bigger than the element of a I change vector on >= not ==)

This should be easy to do with a function that is calling itself, I think. But I can't wrap my head around it.

I hope somebody can even understand what I'm trying to do, but I can't explain it better at the moment. In theory, I think this should be way faster for vectors in ascending order, since you only have to do N comparisons (N being the size of the larger vector) instead of N*M.

Fl.pf.
  • 324
  • 1
  • 5
  • 18
  • 1
    I think it'd make sense to tackle this algorithmically rather than by trying to optimize a loop. For example, sorting the arrays could lead to considerable optimizations. Brute-force comparing seems like the least elegant way to solve this. How large are these arrays, typically? – tadman Feb 13 '16 at 15:08
  • The size of the vectors is up to around a million – Fl.pf. Feb 13 '16 at 15:10
  • Use an `unordered_map`. – Karoly Horvath Feb 13 '16 at 15:11
  • The most important question here is whether vectors are *sorted* or not. And you only mention this in passing somewhere in the middle of your question. So, once again: when they are given to you, are these two vectors *already sorted* or not? – AnT stands with Russia Feb 13 '16 at 15:14
  • yes they are sorted in ascending order – Fl.pf. Feb 13 '16 at 15:14
  • My take on it would be sorting both vectors, then using something like `memcmp` on the underlying data (that is, only if `std::is_pod::value` is true). – KABoissonneault Feb 13 '16 at 15:16

3 Answers3

3

Since you stated that the elements are stored in order, you could do something like this:

int i=0;
int j=0;

while(i< a.size() && j<b.size()){
  if(a[i]==b[j]){
    ++counter;
    ++i;
    ++j;
  }else if(a[i]<b[j]){
    ++i;
  }else{
    ++j;
  }
}
Anedar
  • 4,235
  • 1
  • 23
  • 41
  • So easy, yet so effective. Thank you very much, that did the trick! (Its still pretty slow, but works for now ;)) – Fl.pf. Feb 13 '16 at 15:21
  • Yes but the worst case complexity is still N*M. Suppose that A contains 1000000 values and B only one but somewhere to the end. So you have to check all values of A one by one until reaching the last one. Not efficient. – Baj Mile Feb 13 '16 at 21:58
  • @BajMile sry, but its only O(N+M), as AnT stated in his (more elegant but similar) solution. You can see it like this: in every iteration either `i` or `j` is increased by 1, so in the worst case you end up with `i=N` and `j=M` with no match at all (because that would result in both beeing increased at the same time), which needs exactly `N+M` increases with `2*(N+M)` comparisons. – Anedar Feb 13 '16 at 22:05
  • Yes, sorry. Maybe the average complexity is proportional to O(N+M) but I think it is still inefficient in some marginal cases. You have bug in the above code - first ++i should be removed if you want to count the repetitions in the first buffer too. – Baj Mile Feb 15 '16 at 17:23
3

This is variation of the classic problem of merging two sorted vectors.

If the vectors are sorted, the all you need to do is perform an incremental linear search of sequential elements of vector a in vector b. Each search in b begins where the previous search left off. This approach will take O(a.size() + b.size()) comparisons.

int count = 0;

int j = 0;
for (int i = 0; i < a.size(); ++i)
  for (; j < b.size() && b[j] <= a[i]; ++j)
    if (b[j] == a[i])
      ++count;

If you look closely, you'll see that this is exactly the same algorithm as in @Anedar's answer, just expressed from a "different vantage point".

However, if these two vectors have significantly different length (say, a is much shorter than b) then it might make sense to take sequential elements of a and do a binary search in b. Again, each search in b works "to the right" from the previously found element of b. This approach will take O(a.size() * log b.size()) comparisons.

If a.size() ~ b.size() then O(a.size() + b.size()) is better than O(a.size() * log b.size()). But if a.size() << b.size(), then it's the other way around.

AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765
  • Thank you for your in-depth answer (yes I see its the same as @Anedar's). The problem is, that there is no definition about the relation between sizes. It can be 1:1 in one iteration, 1:1000000 in the next one. So I have to find a middle point. Or use a threshold to choose between binary search and linear search... – Fl.pf. Feb 13 '16 at 15:34
  • @Fl.pf.: In theory, you can "blend" these two approaches as I describe here: http://stackoverflow.com/a/12993740/187690. But in real life a simple threshold might indeed be perfectly sufficient. – AnT stands with Russia Feb 13 '16 at 16:01
0

If you are not required to use the vector you can use std::multiset, as container for the values in both sets.

This std::multiset container allows to insert duplicating values and in the same time values are ordered. You can get the number of the elements with std::multiset::count.

One check will be performed in O(logN) complexity, to check if all values from B are contained in A and the counts of repetition the complexity is M(logN) in the worst case. Same can be achieved with std::map but sets takes less memory than maps and are simpler to use.

If a value is in set B more than once only one check can be made in set A.

Baj Mile
  • 750
  • 1
  • 8
  • 17
  • But the OP has *two separate* sets of data. That's given. How do you propose to solve this problem in `O(log N)` with "multisets"? What multisets specifically? One multiset or two multisets? If you simply use two miltisets instead of two vectors, then no, you will not be able to solve on `O(log N)`. – AnT stands with Russia Feb 13 '16 at 15:58
  • O(log N) is one check in A. To check all values from B you need to make MO(log N). But if B is also set M is reduced to the unique values number. – Baj Mile Feb 13 '16 at 21:53
  • But one check in a sorted vector is also `O(log N)`, which means that checking everything is `O(M * log N)` in sorted vectors. But two sorted sequences can be checked in `O(M + N)`, as shown in other answers, which is better than `O(M * log N)` when `M` and `N` are close. This applies identically to sets and sorted vectors. In fact, there's absolutely no difference between a set and sorted vector in this context. Which is why it is unclear to me what specifically is the point you are trying to make. – AnT stands with Russia Feb 14 '16 at 17:17