15

I am trying to solve the following problem: Numbers are being inserted into a container. Each time a number is inserted I need to know how many elements are in the container that are greater than or equal to the current number being inserted. I believe both operations can be done in logarithmic complexity.

My question: Are there standard containers in a C++ library that can solve the problem? I know that std::multiset can insert elements in logarithmic time, but how can you query it? Or should I implement a data structure (e.x. a binary search tree) to solve it?

Alex Brooks
  • 1,151
  • 1
  • 10
  • 39
Ashot
  • 10,807
  • 14
  • 66
  • 117
  • 2
    I think a red-black tree in which every node stores the size of its subtree would work. But there may be an easier way. – aschepler Jul 02 '13 at 15:03
  • 1
    Perhaps a variant of an insertion sort, which counts as it goes. – lurker Jul 02 '13 at 15:03
  • why not use [`std::set.lower_bound`](http://en.cppreference.com/w/cpp/container/set/lower_bound)? – Vlad Jul 02 '13 at 15:04
  • Perhaps just `s.lower_bound(value) - s.begin()`? Not sure. **Update**: no, won't compile :( – Vlad Jul 02 '13 at 15:06
  • My recommendation for this if you need to rolling your own implementation and does not care about a somewhat larger constant factor is to use a skip list. Much easier to write. – zw324 Jul 02 '13 at 15:08
  • @Vlad this iterator is bidirectional and does not support subtraction. – ondrejdee Jul 02 '13 at 15:11
  • 4
    @Vlad For `set` iterators, `std::distance` is O(N). – aschepler Jul 02 '13 at 15:11
  • 1
    @Vlad: only by storing the size of subtrees in each node, and the standard doesn't require that. – Steve Jessop Jul 02 '13 at 15:15
  • @Steve: you're right, I retract my comment. – Vlad Jul 02 '13 at 15:16
  • @Steve: ... but AVL is not required by the standard, and AFAIK red-black trees are used in (at least) gcc. – Vlad Jul 02 '13 at 15:19
  • @SteveJessop: How does knowing a subtree height help? – aschepler Jul 02 '13 at 15:20
  • @aschepler: you're right, sorry, I've confused knowing the subtree height with knowing the size. It's the size you need, so I no longer think there's any special benefit from an AVL tree. – Steve Jessop Jul 02 '13 at 15:23
  • 2
    I do not think there is anything in STL which would suit your needs (provided you MUST have logarithmic times). I think the best solution then, as @aschepler says, is to implement a RB tree. You may have a look at STL source code, particularly on `stl_tree.h` to see whether you could use it as a template (I mean informal template, not C++ template :) – ondrejdee Jul 02 '13 at 15:31
  • It's not a solution as asked, but maintaining a sorted array/vector with mergesort could work since mergesort works great with mostly-sorted containers. Inserting would be O(1) (push at end), sort would be O(nlogn) (mergesort worst case) and calculating numbers greater than the inserted would be O(logn) (size - binary search to index) – Chris Cooper Jul 02 '13 at 15:34
  • @ChrisCooper Well the O(n) distance on a set beats the O(n.log(n)) sort anytime... you will need to keep the vector sorted and ensure your own log(n) insertion, but RandomAccess iterators for constant distance computation has a price... vector reallocation. – lip Jul 02 '13 at 16:23

5 Answers5

4

Great question. I do not think there is anything in STL which would suit your needs (provided you MUST have logarithmic times). I think the best solution then, as aschepler says in comments, is to implement a RB tree. You may have a look at STL source code, particularly on stl_tree.h to see whether you could use bits of it.

Better still, look at : (Rank Tree in C++)

Which contains link to implementation:

(http://code.google.com/p/options/downloads/list)

Community
  • 1
  • 1
ondrejdee
  • 476
  • 2
  • 8
  • honestly if you need such an implementation, your n must be huge, as well as the data size needed for comparison (or you would use a vector and your own insertion). If n is huge and the data too, this is more a database problem than a C++ problem... – lip Jul 02 '13 at 16:26
1

You should use a multiset for logarithmic complexity, yes. But computing the distance is the problem, as set/map iterators are Bidirectional, not RandomAccess, std::distance has an O(n) complexity on them:

multiset<int> my_set;
...
auto it = my_map.lower_bound(3);
size_t count_inserted = distance(it, my_set.end()) // this is definitely O(n)
my_map.insert(make_pair(3);

Your complexity-issue is complicated. Here is a full analysis:

If you want a O(log(n)) complexity for each insertion, you need a sorted structure as a set. If you want the structure to not reallocate or move items when adding a new item, the insertion point distance computation will be O(n). If know the insertion size in advance, you do not need logarithmic insertion time in a sorted container. You can insert all the items then sort, it is as much O(n.log(n)) as n * O(log(n)) insertions in a set. The only alternative is to use a dedicated container like a weighted RB-tree. Depending on your problem this may be the solution, or something really overkill.

  • Use multiset and distance, you are O(n.log(n)) on insertion (yes, n insertions * log(n) insertion time for each one of them), O(n.n) on distance computation, but computing distances is very fast.
  • If you know the inserted data size (n) in advance : Use a vector, fill it, sort it, return your distances, you are O(n.log(n)), and it is easy to code.
  • If you do not know n in advance, your n is likely huge, each item is memory-heavy so you can not have O(n.log(n)) reallocation : then you have time to re-encode or re-use some non-standard code, you really have to meet these complexity expectations, use a dedicated container. Also consider using a database, you will probably have issues maintaining this in memory.
lip
  • 670
  • 3
  • 9
  • you should use `upper_bound` instead of `lower_bound`. `lower_bound` + `distance` would give you the number of elements greater than or equal. The question was asked to give the number of elements greater than the search value. – Nathan Ernst Jul 02 '13 at 16:25
  • @NathanErnst: the question says, "I need to know how many elements are in the container that are greater than **or equal to** the current number being inserted" (my emphasis). – Steve Jessop Jul 02 '13 at 16:27
  • @lip: I think some of your analyses are off. Inserting into an ordered vector isn't expected `O(log n)` per insertion, it's only expected `O(n)` per insertion. However, reallocating is average `O(1)`, not average `O(log n)`. The reason is that there are `O(log n)` allocations required when doing `n` insertions, but their sizes are in a geometric progression, whose sum is `O(n)`. This bound is guaranteed. – Steve Jessop Jul 02 '13 at 16:32
  • Thx @SteveJessop you're right, I'll remove the vector example altogether, it just does not match other ideas – lip Jul 02 '13 at 17:11
  • @SteveJessop, the author contradicts theirself between the subject and the actual question. I was referring to the subject. – Nathan Ernst Jul 03 '13 at 01:12
1

Here's a quick way using Policy-Based Data Structures in C++:

There exists something called as an Ordered Set, which lets you insert/remove elements in O(logN) time (and pretty much all other functions that std::set has to offer). It also gives 2 more features: Find the Kth element and **find the rank of the Xth element. The problem is that this doesn't allow duplicates :(

No Worries though! We will map duplicates with a separate index/priority, and define a new structure (call it Ordered Multiset)! I've attached my implementation below for reference.

Finally, every time you want to find the no of elements greater than say x, call the function upper_bound (No of elements less than or equal to x) and subtract this number from the size of your Ordered Multiset!

Note: PBDS use a lot of memory, so that is a constraint, I'd suggest using a Binary Search Tree or a Fenwick Tree.

#include <bits/stdc++.h>
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>
using namespace std;
using namespace __gnu_pbds;

struct ordered_multiset { // multiset supporting duplicating values in set
    int len = 0;
    const int ADD = 1000010;
    const int MAXVAL = 1000000010;
    unordered_map<int, int> mp; // hash = 96814
    tree<int, null_type, less<int>, rb_tree_tag, tree_order_statistics_node_update> T;

    ordered_multiset() { len = 0; T.clear(), mp.clear(); }

    inline void insert(int x){
        len++, x += MAXVAL;
        int c = mp[x]++;
        T.insert((x * ADD) + c); }

    inline void erase(int x){
        x += MAXVAL;
        int c = mp[x];
        if(c) {
            c--, mp[x]--, len--;
            T.erase((x*ADD) + c); } }

    inline int kth(int k){        // 1-based index,  returns the
        if(k<1 || k>len) return -1;     // K'th element in the treap,
        auto it = T.find_by_order(--k); // -1 if none exists
        return ((*it)/ADD) - MAXVAL; } 

    inline int lower_bound(int x){      // Count of value <x in treap
        x += MAXVAL;
        int c = mp[x];
        return (T.order_of_key((x*ADD)+c)); }

    inline int upper_bound(int x){      // Count of value <=x in treap
        x += MAXVAL;
        int c = mp[x];
        return (T.order_of_key((x*ADD)+c)); }

    inline int size() { return len; }   // Number of elements in treap
};

Usage:

    ordered_multiset s;
    for(int i=0; i<n; i++) {
        int x; cin>>x;
        s.insert(x);
        int ctr = s.size() - s.upper_bound(x);
        cout<<ctr<<" ";
    }

Input (n = 6) : 10 1 3 3 2
Output : 0 1 1 1 3

Time Complexity : O(log n) per query/insert

References : mochow13's GitHub

relaxxpls
  • 33
  • 1
  • 2
  • 8
0

Sounds like a case for count_if - although I admit this doesn't solve it at logarithmic complexity, that would require a sorted type.

vector<int> v = { 1, 2, 3, 4, 5 };
int some_value = 3;

int count = count_if(v.begin(), v.end(), [some_value](int n) { return n > some_value; } ); 

Edit done to fix syntactic problems with lambda function

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
0

If the whole range of numbers is sufficiently small (on the order of a few million), this problem can be solved relatively easily using a Fenwick tree.

Although Fenwick trees are not part of the STL, they are both very easy to implement and time efficient. The time complexity is O(log N) for both updates and queries and the constant factors are low.

You mention in a comment on another question, that you needed this for a contest. Fenwick trees are very popular tools in competitive programming and are often useful.

Community
  • 1
  • 1
Jørgen Fogh
  • 7,516
  • 2
  • 36
  • 46