0

I have a set of N customers, indexed 0,...,N-1. Periodically, for some subset S of customers, I need to evaluate a function f(S). Computing f(S) is of linear complexity in |S|. The set S of customers is represented as an object of type std::vector<int>. The subsets that come up for evaluation can be of different size each time. [Since the order of customers in S does not matter, the set can as well be represented as an object of type std::set<int> or std::unordered_set<int>.]

In the underlying application, I may have the same subset S of customers come up multiple times for evaluation of f(S). Instead of incurring the needless linear complexity each time, I am looking to see if it would benefit from some sort of less computational burdensome lookup.

I am considering having a map of key-value pairs where the key is directly the vector of customers, std::vector<int> S and the value mapped to this key is f(S). That way, I am hoping that I can first check to see if a key already exists in the map, and if it does, I can look it up without having to compute f(.) again.

Having an std::map with std::vector as keys is well-defined. See, for e.g., here.

CPPReference indicates that map lookup time is logarithmic. But I suppose this is logarithmic in the number of keys where each key if of a constant length -- such as an int or a double, etc. How is the complexity affected where the key itself need not be of constant length and can be of arbitrary length upto size N?

Since the keys can themselves be of different sizes (subset of customers that come up for evaluation could be different each time), does this introduce any additional complexity in computing a hash function or the compare operation for the std::map? Is there any benefit to maintain the key as a binary array a fixed length N? This binary array is such that B_S[i]=1 if the ith customer is in set S, and it is 0 otherwise. Does this make the lookup any easier?

I am aware that ultimately the design choice between reevaluating f(S) each time versus using std::map would have to be done based on actual profiling of my application. However, before implementing both ideas (the std::map route is more difficult to code in my underlying application), I would like to know if there are any known pre-existing best-practices / benchmarks.

Tryer
  • 3,580
  • 1
  • 26
  • 49
  • mb you should use wrapper class, which will store vector and value of f for said vector, you will modify vector via methods of wrapper and recalculate f on modification of vector. This way you can avoid recalculations without using map – Andrew Kashpur Jan 06 '22 at 08:38
  • 1
    Just a remark: a vector is not a set. Order matters and a vector can contain duplicates. Unsure whether it is relevant here, but what is the reason for not using a `std::set` (or `std::unordered_set`)? – Serge Ballesta Jan 06 '22 at 08:45
  • @SergeBallesta Indeed you are correct. In the underlying application, the order need not matter. I might as well go with `std::set` or `std::unordered_set` if that is going to make computations and lookups easier. I will edit the question accordingly. – Tryer Jan 06 '22 at 08:47
  • 1
    I would use `std::unordered_map` instead, you don't have to compare full vectors/sets each time, you can just compare hashes when doing a lookup – coyotte508 Jan 06 '22 at 09:06
  • @coyotte508 Yes, I will consider that. https://en.cppreference.com/w/cpp/container/unordered_map indicates that lookup times are average constant time. – Tryer Jan 06 '22 at 09:11
  • 1
    even if benchmarks exist you need to measure for your specific application. You are bascially asking if it is worh to use memoization and that depends on the cost of `f` vs lookup in the map, both depends on input size, `f` and the comparison of keys. Btw you don't need to store the whole subset as keys in the map. Subsets of a set of N elements can be enumerated with a N-bit integer (or bitset) which makes the lookup rather efficient – 463035818_is_not_an_ai Jan 06 '22 at 09:28
  • 1
    @Tryer: How many customers might you have? How big are the sets of customers? These are important factors in the decision about how to implement a memorization table. – rici Jan 06 '22 at 15:36
  • 1
    Another consideration: you say that `f(S)` takes `O(|S|)`, but key comparison also takes `O(|S|)` and you must do at least one full key comparison for each lookup. I suppose that the constant factor for `f(S)` is much larger than the constant factor for comparison, but that needs to be factored in as well. If `f(S)` takes a thousand times as long as comparing `S == T`, then memorization might make sense. If it takes ten times as long, the advantage is rather more dubious. And yet another consideration: what fraction of computations can be avoided by memorization? – rici Jan 06 '22 at 15:41
  • @rici. I am looking at around 200 for `N` and `S` can be between 10 to 25. Computing `f(S)` is a heuristic algorithm to obtain a bound of some sort on the worthiness of this customer set. I already did some profiling of my application before where I was re-calculating `f(S)` for each set that comes up and this was indeed the hotspot. Hence my search for `std::map` (or others of that nature) type memoization. – Tryer Jan 06 '22 at 15:52
  • 1
    @Tryer: Ok. That's not many customers but it's still possibly enough to make a bitvector non-optimal. Suppose you represent a set of customers with a sorted sequence of `uint8_t` (possibly indexes into a vector of IDs). That's certainly adequate for 200 customers. Then the keylength for a set of 10 to 25 of them would range between 80 and 200 bits, whereas the bitvector would always be 200 bits. That comparison doesn't include the overhead of `std::vector`, which is both significant and pointless since keys are immutable; it assumes you find a more compact representation. – rici Jan 06 '22 at 16:01
  • 1
    The storage cost of the keys could be significant if the table is large, but I was really thinking about the cost of comparison, which is related to the length of the representation. If you use `std::map`, then you also need to consider how far into the representation is the first difference. If you use `std::unordered_map`, then you need to consider the cost of hashing (linear in the representation length) as well as the comparison cost. – rici Jan 06 '22 at 16:04

1 Answers1

2

Complexity of lookup in a map is O(log N) That is, roughly log N comparisons are needed when there are N elements in the map. The cost of the comparison itself adds to that linearly. For example when you compare M vectors with K elements, then there are roughly log N comparisons, each comparing M*K vector elements, ie in total O(M*K*log N).

However, asymptotic complexity is only that: Asymptotic complexity. When there are only a small number of elements in the map then lower order factors might outweigh the log N that only dominates for large N. Consequently, the actual runtime depends on your specific application and you need to measure to be sure.

Moreover, you shouldn't use vectors as keys in the first place. Its a waste of memory. Subsets of S can be enumerated with a n-bit integer when S has n elements (simply set the i-th bit when i-th element of S is in the subset). Comparing a single integer (or bitset) is surely more efficient than comparing vectors of integers.

463035818_is_not_an_ai
  • 109,796
  • 11
  • 89
  • 185
  • The `N` in my OP need not be a const. As a result, I may be forced to use an `std::vector` instead of `std::bitset`. Is there a way an optimizing compiler can work as efficiently with the former as with the latter? – Tryer Jan 06 '22 at 14:08
  • I don't understand your statement "when you compare M vectors with K elements, then there are roughly log N comparisons". In the OP, `N` is the number of possible values of the elements of the vector, and it doesn't figure into the complexity if we can assume that it fits into a word. I presume that you are using `M` as the total occupancy of the table (the number of keys) and `K` as something like the average length of a single key. In that case, the lookup time is `O(K * log(M))`, but the average lookup time might be a lot less, depending on the distribution of keys. – rici Jan 06 '22 at 15:30