0

I have a question about how to efficiently search two containers in order to find same items.

For example, I have two list A, B, and I want to figure out all of matched items in list B for list A.

In this case, I need to have two loopS, one inside another. It is not good, because for each items of A, I do a whole search in B.

Do you have some ideas or standard lib (boost is OK) to solve it ;) ?

Thanks a lot!

CJAN.LEE
  • 1,108
  • 1
  • 11
  • 20

5 Answers5

4

You could std::sort() the containers an then use std::set_intersection() (I'm not entirely sure about the name of this algorithm). The complexity would be O(n ln n + m ln m) rather than O(n * m) with n and m being the size of the sequences.

Arne Mertz
  • 24,171
  • 3
  • 51
  • 90
Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
  • [`std::set_itersection`](http://en.cppreference.com/w/cpp/algorithm/set_intersection) is correct. The complexity mentioned is for the two sorts - if the sequences are sorted already, it is `O(n+m)` for the `set_intersection`. – Arne Mertz Nov 06 '13 at 12:37
  • 1
    If my calculations are correct, sorting only *one* container and using `std::binary_search` is better, with a complexity of `O((n+m) ln m)`. See my answer. – Arne Mertz Nov 06 '13 at 13:08
  • @Arbe Mertz: Depending on the relative size of the containers it may be better: The complexity is `O((m + n) ln x)` wit `x` being `max(m, n)` for sorting both containers and `min(m, n)` for sorting just one. If `m` and `n` are roughly the same size it doesn't matter, if they are dramatically different (orders of magnitude), sorting just one would be better. Next question is, which one has the smaller constants for pratical problem sizes. – Dietmar Kühl Nov 06 '13 at 19:12
  • Well I guess then it just boils down to the usual thing with performance: profile it ;-) – Arne Mertz Nov 07 '13 at 06:41
4

As you can see from the different answers, there are multiple approaches. Any of those can be correct depending on what your containers are and on if the ranges are sorted and of the typical size of the ranges and if sorting the ranges is an option.

  • If both containers are sorted, std::set_intersection is the best way to go, it's complexity is O(n+m)
  • Sorting a container of size n has complexity O(n log(n)) in terms of comparisons and swaps. Sorting a list means swapping list nodes, which is cheap. Sorting a vector means actually swapping the elements, and the cost depends on the element type.
  • With one sorted and one unsorted container it is best to do a std::binary_search for each element of the unsorted range in the sorted range. The complexity of that would be O(n log(m)) with n being the size of the unsorted, m of the sorted range. Sorting the unsorted range first and using set_intersection afterwards would have a complexity of O(n log(n) + m) which is worse.
  • Having two unsorted containers, it pays out to have one of them sorted and then apply binary_search for the elements of the other one, giving a complexity of O((m+n) log(m)), so if both containers have the same type, sorting the smaller container is better.
BartoszKP
  • 34,786
  • 15
  • 102
  • 130
Arne Mertz
  • 24,171
  • 3
  • 51
  • 90
3

If you have two lists A (size n) and B (size m), then finding every element in B that exists in A is O(nm) using a nested loop.

I'd suggest using a hash set. If you build a hash set with the elements in B, you'll spend O(m) building the set and then O(n) looking up every element of A in hash_set(B). So complexity would be O(n+m)

Jong Bor Lee
  • 3,805
  • 1
  • 24
  • 27
0

Maybe you can sort A and B in array firstly. Then count the same elements. It's O(n*log(n)),but need more space.

likebrain
  • 1
  • 1
0

If you're looking to optimize the solution, you should share some more information about the problem domain. For example, if you knew that all the items in the lists were integers between 1 and 100, you could have used a simple Booleans array[100], and finish the task by running once on A (raising the appropriate flags) and then once on B (testing the flags).

If the lists have arbitrary contents, you have to settle for a generic solution. The naive solution would be to have a double loop like you suggested, which is not necessarily that bad. You can make several practical optimizations:

  1. The outer loop should be done on the shorter list (provided you know their length). This means that your inner loop can break as soon as you find your item (if you find it...).
  2. If memory is not an issue, you can sort both lists, and then progress over both of them in parallel, moving forward on one list as long as the other list's item are bigger (like you do a sorted merge between them). This has an order of magnitude of O(NlogN+MlogM+max(N,M)), which is probably better than O(N*M), but also wasteful in terms of memory.
Eldad Mor
  • 5,405
  • 3
  • 34
  • 46