Efficiently find an integer not in a set of size 40, 400, or 4000

Question

Related to the classic problem find an integer not among four billion given ones but not exactly the same.

To clarify, by integers what I really mean is only a subset of its mathemtical definition. That is, assume there are only finite number of integers. Say in C++, they are int in the range of [INT_MIN, INT_MAX].

Now given a std::vector<int> (no duplicates) or std::unordered_set<int>, whose size can be 40, 400, 4000 or so, but not too large, how to efficiently generate a number that is guaranteed to be not among the given ones?

If there is no worry for overflow, then I could multiply all nonzero ones together and add the product by 1. But there is. The adversary test cases could delibrately contain INT_MAX.

I am more in favor of simple, non-random approaches. Is there any?

Thank you!

Update: to clear up ambiguity, let's say an unsorted std::vector<int> which is guaranteed to have no duplicates. So I am asking if there is anything better than O(n log(n)). Also please note that test cases may contain both INT_MIN and INT_MAX.

are your given input integers unique or can there be duplicates? — Walter, Dec 18 '18 at 14:40
not exactly the same but how different is your problem to the one you quoted? I see you can apply the same algorithm to find your number. — adrtam, Dec 18 '18 at 14:41
sorting the vector can be done in `O(n log(n))`, not sure if you can get anything more efficient — 463035818_is_not_an_ai, Dec 18 '18 at 14:41
Is the vector sorted? For the set it is trivial since it is sorted. — NathanOliver, Dec 18 '18 at 14:41
@user463035818 if you can sort a vector in `O(log(n))`, you should become very rich indeed. — Walter, Dec 18 '18 at 14:43
@AdrianTam Because size is smaller, so maybe there is simpler one? — aafulei, Dec 18 '18 at 14:43
@NathanOliver could you please demonstrate the trivial solution? — aafulei, Dec 18 '18 at 14:44
@fleix If it has to be inside the range then: Get the last element, then get the one before it. If there is a gap between the numbers then pick one of the numbers in that gap. If not move back another element. Do this until you reach a gap. At worst it is O(N) to find there isn't a solution. — NathanOliver, Dec 18 '18 at 14:47
Otherwise JVApen's comment should work unless `last == INT_MAX`. Then you need to work backwards. — NathanOliver, Dec 18 '18 at 14:48
The sorting approach is fairly optimal. I'm pretty sure you could get better average running time with a randomized approach. With a maximum of only 4k integers, a randomly generated integer has about a six percent chance of being one of those 4k. With a `std::set` already in place lookups only cost O(log n). If you want, you can sample a distribution rather than uniformly randomly to guarantee that it will *eventually* terminate. — AndyG, Dec 18 '18 at 14:51
@AndyG thanks for the observation which I agree with. Just that this ought to be a small part of my code and I want to kill it with 5 lines. Feel a little cumbersome with setting up random engine/seed. But if every one says it's good, I'll take it. — aafulei, Dec 18 '18 at 14:59
@fleix: Is the `std::vector` also guaranteed to be sorted already like the `std::set`? — AndyG, Dec 18 '18 at 15:06
@AndyG let's say it's unsorted, as I just put in the update. — aafulei, Dec 18 '18 at 15:08
How many numbers do you want to generate, just a single one? — Bergi, Dec 19 '18 at 12:40
By the latest update this is probably not relevant anymore but I think it could be possibly done in O(lg N) time and O(1) extra space if the data is given in form of **sorted** array. I may try to explain my idea if anyone's interested. — Display Name, Dec 19 '18 at 13:54

Walter · Accepted Answer · 2018-12-18T15:34:23.797

31

You could just return the first of N+1 candidate integers not contained in your input. The simplest candidates are the numbers 0 to N. This requires O(N) space and time.

 int find_not_contained(container<int> const&data)
 {
     const int N=data.size();
     std::vector<char> known(N+1, 0);   // one more candidates than data
     for(int i=0; i< N; ++i)
         if(data[i]>=0 && data[i]<=N)
             known[data[i]]=1;
     for(int i=0; i<=N; ++i)
         if(!known[i])
             return i;
     assert(false);                     // should never be reached.
 }

Random methods can be more space efficient, but may require more passes over the data in the worst case.

edited Dec 18 '18 at 15:34

answered Dec 18 '18 at 15:15

Walter

44,150
20
113
196

2

I like this. It avoids sorting and searching. – aafulei Dec 18 '18 at 15:50
4

This is the standard algorithm for finding the smallest integer not in a given set – DreamConspiracy Dec 18 '18 at 22:20
@DreamConspiracy That makes sense. I didn't know that, though. – Walter Dec 19 '18 at 00:39
2

Fun fact: on a machine with scatter stores, like x86 with AVX512F, this algorithm can be vectorized. (At least with x86-style scatters where a mask controls which elements are actually scattered). And unlike a histogram problem where you have to do conflict detection for multiple vector elements hitting the same index, you're only ever storing a `1` so it's fine. (x86 can only scatter with 32 or 64-bit granularity, so you'd have to promote `known` to `int`, which may not be worth it for large `N`. With large enough `N`, you'd want to use a bitmap instead of a char array.) – Peter Cordes Dec 19 '18 at 03:59
The search at the end can be vectorized easily, using whatever tricks are applicable for a hand-optimized `strlen`. (e.g. modern x86 should be able to check 16 to 64 bytes per clock cycle, with a well-tuned loop using SSE2 or AVX2, depending on the hardware.) – Peter Cordes Dec 19 '18 at 04:03

Damien · Answer 2 · 2018-12-18T15:30:39.647

Random methods are indeed very efficient here.

If we want to use a deterministic method and by assuming the size n is not too large, 4000 for example, then we can create a vector x of size m = n + 1 (or a little bit larger, 4096 for example to facilitate calculation), initialised with 0.

For each i in the range, we just set x[array[i] modulo m] = 1.

Then a simple O(n) search in x will provide a value which is not in array

Note: the modulo operation is not exactly the "%" operation

Edit: I mentioned that calculations are made easier by selecting here a size of 4096. To be more concrete, this implies that the modulo operation is performed with a simple & operation

score 6 · Answer 3 · edited Jun 20 '20 at 09:12

You can find the smallest unused integer in O(N) time using O(1) auxiliary space if you are allowed to reorder the input vector, using the following algorithm. [Note 1] (The algorithm also works if the vector contains repeated data.)

size_t smallest_unused(std::vector<unsigned>& data) {
  size_t N = data.size(), scan = 0;
  while (scan < N) {
    auto other = data[scan];
    if (other < scan && data[other] != other) {
      data[scan] = data[other];
      data[other] = other;
    }
    else
      ++scan;
  }
  for (scan = 0; scan < N && data[scan] == scan; ++scan) { }
  return scan;
}

The first pass guarantees that if some k in the range [0, N) was found after position k, then it is now present at position k. This rearrangement is done by swapping in order to avoid losing data. Once that scan is complete, the first entry whose value is not the same as its index is not referenced anywhere in the array.

That assertion may not be 100% obvious, since a entry could be referenced from an earlier index. However, in that case the entry could not be the first entry unequal to its index, since the earlier entry would be meet that criterion.

To see that this algorithm is O(N), it should be observed that the swap at lines 6 and 7 can only happen if the target entry is not equal to its index, and that after the swap the target entry is equal to its index. So at most N swaps can be performed, and the if condition at line 5 will be true at most N times. On the other hand, if the if condition is false, scan will be incremented, which can also only happen N times. So the if statement is evaluated at most 2N times (which is O(N)).

Notes:

I used unsigned integers here because it makes the code clearer. The algorithm can easily be adjusted for signed integers, for example by mapping signed integers from [INT_MIN, 0) onto unsigned integers [INT_MAX, INT_MAX - INT_MIN) (The subtraction is mathematical, not according to C semantics which wouldn't allow the result to be represented.) In 2's-complement, that's the same bit pattern. That changes the order of the numbers, of course, which affects the semantics of "smallest unused integer"; an order-preserving mapping could also be used.

To handle OP's requirement smallest positive unused is sufficient, so an even easier way to handle signed integers is to not swap if the number is negative. — Taemyr, Dec 19 '18 at 14:13
@taemyr: that's what happens if you use the "2's complement" mapping of negative integers, which doesn't require an extra test (assuming C in which the cast is a no-op). In general, if you want to find the smallest available number in some range, you can ignore any entries outside of that range and subtracting the beginning of the range from useful entries. I was going to add that observation but I was mostly interested in the elegance of the original algorithm. — rici, Dec 19 '18 at 14:18

score 4 · Answer 4 · answered Dec 18 '18 at 14:52

4

Make random x (INT_MIN..INT_MAX) and test it against all. Test x++ on failure (very rare case for 40/400/4000).

answered Dec 18 '18 at 14:52

Alexey Birukov

1,565
15
22

2

He is looking for an efficient way to do it, this would be O(n^2). Yes, it will be way better most times, but still very inefficient worst-case (and OP specifically asked for **O**, that's for worst case) – dquijada Dec 18 '18 at 14:54
@dquijada "*that's for worst case*" No, it isn't. – Acorn Dec 18 '18 at 14:58
@Acorn yes it is. Big-O gives the upper bound – dquijada Dec 18 '18 at 15:06
1

@dquijada No, it isn't. Big O notation describes the asymptotic behavior of a function, but it does *not* mean it is the worst-case of anything. In other words, an algorithm may have different upper bounds depending on the input you are studying. – Acorn Dec 18 '18 at 15:15
2

@Acorn Typically with big-O notation you describe the worst-case time complexity of an algorithm, i.e. the worst time it can have for **all** inputs. Sure, a subset of the inputs may run in with a smaller complexity but worst-case means considering all possibilities. Big-O can also be used to describe the *average*-case complexity. However, as far as I know, the "default meaning" for the sentence "This algorithm takes O(f(n))" time is that O(f(n)) is the **worst-case** scenario, not the average case. – Bakuriu Dec 18 '18 at 23:31
1

@Bakuriu Big O notation has nothing to do with the type of complexity analyzed. It may not even be time complexity. While you can, of course, argue that one may *assume* we are talking about worst-case time complexity, dquijada said that "*OP specifically asked for O, that's for worst case*", which is simply not true. OP didn't *specifically* say anything about worst-case (quite the opposite: we can only assume), nor "O" means worst-case either (that is not true at all). – Acorn Dec 18 '18 at 23:35
3

Further, in this specific problem, this solution is actually very efficient in practice, because you almost never hit the worst case (and/or you can simply fallback to another solution if you detect you may be hitting it, e.g. after a few tries; giving you back a worst-case O(n) algorithm easily). So dismissing this solution by saying it is O(n^2) and claiming that OP wanted a good worst-case algorithm, is simply wrong. – Acorn Dec 18 '18 at 23:42
@Acorn: you can specify a big-O bound for the average or best case if you want, but in this case we're talking about the complexity class of the algorithm over any/all possible input data sets, including hostile inputs that happen to contain all the integers from `x` to `x+N`, where `x` is the random number that the algorithm generates. If an algorithm can be worse than O(n) for some inputs, it's not a member of the set of O(n) algorithms. In practice this is a very good algorithm, especially if you replace `x++` with a simple PRNG step (e.g. linear congruential generator). – Peter Cordes Dec 19 '18 at 04:08
@PeterCordes You may argue that the question talked about adversary cases, so OP may only want to see good worst-case algorithms, fine with me. You may also argue that big O notation, while informally speaking, may refer to worst-case time complexity when no further clarification/context is provided, fine with me too. Yet, however you want to put it, big O notation does not imply worst-case time complexity. – Acorn Dec 19 '18 at 08:37

dquijada · Answer 5 · 2018-12-18T15:14:27.947

2

Step 1: Sort the vector.

That can be done in O(n log(n)), you can find a few different algorithms online, use the one you like the most.

Step 2: Find the first int not in the vector.

Easily iterate from INT_MIN to INT_MIN + 40/400/4000 checking if the vector has the current int:

Pseudocode:

SIZE = 40|400|4000 // The one you are using
for (int i = 0; i < SIZE; i++) {
    if (array[i] != INT_MIN + i)
        return INT_MIN + i;

The solution would be O(n log(n) + n) meaning: O(n log(n))

Edit: just read your edit asking for something better than O(n log(n)), sorry.

edited Dec 18 '18 at 15:14

answered Dec 18 '18 at 14:52

dquijada

1,697
3
14
19

Thanks for the answer. A minor point: though it being pseudocode, it doesn't seem `array[i]` would make any sense when `i` starts from `INT_MIN`. – aafulei Dec 18 '18 at 15:10
1

You can do the search using binary search in O(log N). – rici Dec 18 '18 at 20:25
1

@rici: how would you binary search for "a missing number"? If you knew a number that was missing, you could find the position it belongs in a sorted array in log(n) time. I don't think you can do better than linear, although probably with a low constant factor. With a uniform distribution, a linear search will hit a gap in `O(n / max_n)` time on average, but worst case `n` (no gaps until the end.) – Peter Cordes Dec 18 '18 at 23:50
If you sort with Selection Sort, you can check for gaps on the fly, and with a high probability will find a non-duplicate in the first or 2nd pass. (Especially if you check for a random guess candidate as well.) For small sizes, O(n) notation is less meaningful / useful. – Peter Cordes Dec 19 '18 at 00:08
2

@peter: the same way you do a linear search, compare `i + min` with `vec[i]`. If they are unequal, there is a missing value before `i`, so you bisect down; otherwise you bisect up. Binary search works with any monotonic predicate. – rici Dec 19 '18 at 02:05
@rici: ah right, that makes sense. But if your data isn't sorted in the first place, I think you'd still be better off searching on the fly *during* sorting, with a sorting algo that partitions or bins like QuickSort or RadixSort, so you can potentially find a gap in a subset of the whole array before you finish sorting the whole thing, and be sure that the missing value isn't present anywhere else. Or for small arrays, maybe SelectionSort looking for min and max simultaneously, so you normally win very quickly. You don't even need to put the sorted values anywhere, just low/hi counters. – Peter Cordes Dec 19 '18 at 03:37
@PeterCordes: Sure, although I'm pretty fond of the O(N) solution I presented as an answer (assuming you are allowed to rearrange the data). It's faster than sorting, and since you can't definitely identify the first gap until you've seen all the data, it's asymptotically optimal. But I think "binary search on a monotonic predicate" is good to keep in mind; here, in case the data was sorted; and in general because the paradigm applies with surprising frequency and if you only think of binary search when you're looking for a definite element, you can easily miss simple O(log N) solutions. – rici Dec 19 '18 at 04:43
@rici: yup, I hadn't read your answer until after my last comment. Looks good, clearly better than sorting, and can even be implemented efficiently with SIMD for the no-swapping fast case, checking a whole SIMD vector of 4 elements to see if any of them have `data[scan] < scan`, and only breaking out of a vector loop if we find one where we need to check `data[other]` and maybe swap. (At least on x86 where you can efficiently branch on a SIMD compare result, unlike ARM NEON.) – Peter Cordes Dec 19 '18 at 05:22

JFMR · Answer 6 · 2018-12-19T09:50:33.663

For the case in which the integers are provided in an std::unordered_set<int> (as opposed to a std::vector<int>), you could simply traverse the range of integer values until you come up against one integer value that is not present in the unordered_set<int>. Searching for the presence of an integer in an std::unordered_set<int> is quite straightforward, since std::unodered_set does provide searching through its find() member function.

The space complexity of this approach would be O(1).

If you start traversing at the lowest possible value for an int (i.e., std::numeric_limits<int>::min()), you will obtain the lowest int not contained in the std::unordered_set<int>:

int find_lowest_not_contained(const std::unordered_set<int>& set) {
   for (auto i = std::numeric_limits<int>::min(); ; ++i) {
      auto it = set.find(i); // search in set
      if (it == set.end()) // integer not in set?
         return *it;
   }
}

Analogously, if you start traversing at the greatest possible value for an int (i.e., std::numeric_limits<int>::max()), you will obtain the lowest int not contained in the std::unordered_set<int>:

int find_greatest_not_contained(const std::unordered_set<int>& set) {
   for (auto i = std::numeric_limits<int>::max(); ; --i) {
      auto it = set.find(i); // search in set
      if (it == set.end()) // integer not in set?
         return *it;
   }
}

Assuming that the ints are uniformly mapped by the hash function into the unordered_set<int>'s buckets, a search operation on the unordered_set<int> can be achieved in constant time. The run-time complexity would then be O(M ), where M is the size of the integer range you are looking for a non-contained value. M is upper-bounded by the size of the unordered_set<int> (i.e., in your case M <= 4000).

Indeed, with this approach, selecting any integer range whose size is greater than the size of the unordered_set, is guaranteed to come up against an integer value which is not present in the unordered_set<int>.

Efficiently find an integer not in a set of size 40, 400, or 4000

6 Answers6

Notes: