12

Please read this question carefully before closing it as a duplicate, though if it is an honest duplicate I'll be happy to know about it. This is a generalization of Find any one of multiple possible repeated integers in a list.

Given any set S of N distinct integers, and any array A of length N+1 each entry of which is taken from S, what is the best algorithm to find some (there must be at least one) repeated entry of A?

NOTE: There could be multiple repeated entries in A, and any entry could be repeated multiple times.

As Nemo points out, the trivial solution takes O(1) space and O(N^2) time. I'm looking for a solution that improves the time without compromising the space too much. To be precise, the solution(s) I'm looking for:

  • Returns a value a that appears in A at least twice,
  • Uses at most O(log N) space without modifying A, and
  • Takes less than O(N^2) time

EDIT: The set S is there to ensure that the array A has at least one repeated entry. For this problem, do not assume that you have S given to you as an ordered set. You can query S (boolean to return true is s in S and false otherwise) and you can query A (call for A[i]), but that's all. Any solution that sorts A or S exceeds the space limit.

This generalization invalidates my pointer solution to the original question (which has O(1) space and O(N) time), and the space constraint I'm imposing invalidates fiver's solution (which has O(N) space and time).

Community
  • 1
  • 1
PengOne
  • 48,188
  • 17
  • 130
  • 149
  • 1
    Traverse A putting items in a hash table. If the item exists in the hash table you know it's a duplicate. Traversing A is O(N), looking in the hash is O(logN). So time complexity is nlogn. Space is N for the hashtable. My assumption is your hash table is backed by a balanced tree which requires logN to look the item up. The hash key is the item and the value is the number of occurrences. – Ryan Reeves Jun 22 '11 at 00:20
  • @Ryan: This reads more like an answer than a comment. The hash table takes up O(N) space, does it not? I want better than O(log N) for space. – PengOne Jun 22 '11 at 00:21
  • I'm sorry I misread your post. I read O(n^2) space. – Ryan Reeves Jun 22 '11 at 00:22
  • Are you allowed to precompute a data structure based on S? If so, you can use perfect hashing to achieve O(N) time and O(N) space, which is not too shabby. Yeah, I know you want O(log N) space... Can we assume the integers themselves are bounded in any way? – Nemo Jun 22 '11 at 00:31
  • @Nemo, I know we can do well if space isn't an issue, so I'm really after keeping space to a **O(log N)** or better. If you have a solution in mind with **S** bounded, then I'd love to read it! – PengOne Jun 22 '11 at 00:37
  • You say we are given a set **S**. Can we modify its contents, while still providing an answer relative to the original **S**? – IVlad Jun 22 '11 at 00:55
  • 4
    Wait, did you say O(N^2) time and O(log N) space? Because I am pretty sure your average fifth grader could do this in O(N^2) time and constant space... :-) – Nemo Jun 22 '11 at 01:48
  • 1
    Are we *told* S, or do we just know that the elements in A are taken from some set S? Equivalently, do we just know that there are at most N distinct elements in A, or do we know what they are? If we know what they are, how is that knowledge provided? (E.g. if S is presented as a sorted list, then an entry's "numeric index" 0 <= i < N can be looked up in log N time.) – j_random_hacker Jun 22 '11 at 06:11
  • @Nemo : Those are just the upper bounds on the problem. We can use at most O(log n) space. This nullifies the O(N) space & time algorithm. – Priyank Bhatnagar Jun 22 '11 at 15:13
  • @logic_max: But he specifically said "I have a solution in mind that is O(log N) in space and O(N^2) in time." And I am sure it is a very clever solution, but... – Nemo Jun 22 '11 at 15:38
  • @Nemo : yes, i am curious where that log(n) space is used in the o(n^2) algorithm. – Priyank Bhatnagar Jun 22 '11 at 16:38
  • @logic_max, @Nemo: You're both correct. I did not need O(log N) space. I kept the constraint in there to leave room to improve the time without resorting to copying the array. I'll edit the question (yet again). Thanks. – PengOne Jun 22 '11 at 16:54

4 Answers4

7

This algorithm is similar to Justin Simon's, but the key point is how to compute the median (or the kth element) of S using just O(1) space efficiently.

Here is that key algorithm, which is randomized:

Set lower equal to the minimum element of S and upper equal to the maximum element of S. Pick a random element x from S that is between lower and upper (this costs at most O(n) expected time). Compute the rank of x (O(n) time). If x's rank is too low, set lower to the successor of x (O(n) time), else set upper equal to the predecessor of x (O(n) time). Repeat until lower equals upper.

Note that each iteration costs O(n) in expectation and there are O(lg n) iterations in expectation so the expected time cost is O(n lg n) and space usage is O(1) since we only store lower and upper.

Using this ability to select the kth element, we can then use the pigeonhole principle as suggested in the original question to find increasingly small segments of S that contain too many elements to all be distinct, using O(lg n) linear scans of A and O(1) space to store the relevant sums of elements in each region. Each such iteration costs O(n) in addition to the O(n lg n) cost of finding the kth element, and there are O(lg n) iterations, so the total cost is O(n lg^2 n).

Community
  • 1
  • 1
jonderry
  • 23,013
  • 32
  • 104
  • 171
  • I'm not understanding some of this. What do you mean by "the rank of x" and "the successor of x"? It reads like you've sorted **S**, which takes O(N) space. – PengOne Jun 22 '11 at 17:57
  • The successor of x is the minimum element (or "a" minimum element) greater than x. This can be found by iterating over S, storing just the smallest element greater than x from among S[1..i]. – jonderry Jun 22 '11 at 18:07
  • BTW, I feel like there must be a problem with this algorithm, since I can't find reference to it online. There are more complicated algorithms for finding the median though that don't quite achieve the same bounds. They would still satisfy the bounds you are looking for. See [this paper](http://www.cs.ucsb.edu/~suri/cs290/MunroPat.pdf). – jonderry Jun 22 '11 at 18:35
  • @jonderry: I'm suspicious of how you're finding the median, but I like the idea very much. I'll give it some thought this evening and see what I come up with. So far, this answer has the most promise. Thanks for thinking about this question! – PengOne Jun 22 '11 at 18:39
  • @PengOne, I think I may have found a reference to the median algorithm in my answer in [this journal paper](http://www.sciencedirect.com/science/article/pii/0304397595002251). I don't have access to the paper but could see some relevant text in a Google search result snippet (try Googling "the kth smallest element from a readonly array"). – jonderry Jun 22 '11 at 19:09
  • @jonderry: I have access. They use **O(log N)** space and the **O(N^(1+epsilon))** time to find the median without modifying the source, where epsilon (as the name suggests) is small (less than 1). This answer definitely deserves an up vote, even if the details don't pan out (which I'm now thinking they may!). – PengOne Jun 22 '11 at 19:10
0

The author of Finding Duplicate Elements in an Array suggests that even if one were to allocate an array of bits to represent every possible integer (a quite-manageable 2^24 byte bit-array gives one bit for every 32-bit integer) would still be defined as using O(1) space, and I tend to agree.

Therefore the simplest possible algorithm which tests and sets the bit representing each integer found in the array, returning the duplicate integer if the bit is already set, would run in O(n) time and use O(1) space.

Greg A. Woods
  • 2,663
  • 29
  • 26
0

Find the midpoint of the set S of N integers (if they're consecutive this is trivial, otherwise it can be done in O(logn)).

Go through your list A, calculate the number of entries which are less than that midpoint. So you then have either more entries in A less than your midpoint than there are distinct numbers in S which do the same, or you have less entries in A less than your midpoint, etc. In the former case take the entries less than the midpoint and repeat, in the latter take those greater than or equal to it.

This solution works in n(log(n))^2 time, I believe.

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
Justin Simon
  • 1,133
  • 8
  • 8
  • @Justin: This is essentially the solution proposed by Daniel to his [original question](http://stackoverflow.com/questions/6420467/find-any-one-of-multiple-possible-repeated-integers-in-a-list). Storing the values to find the midpoint takes **O(N)** space. – PengOne Jun 22 '11 at 01:02
  • How do you find the median in logarithmic time? I only know the **O(n)** selection algorithms, and even those modify **S** while running. – IVlad Jun 22 '11 at 01:03
  • You're 100% right! Misremembered something from algorithms. However if S were sorted you'd be able to do it in constant time – Justin Simon Jun 22 '11 at 01:07
  • This algorithm works and runs in O(n lg^2 n) time and O(1) space. It's easy to select the ith element of S in O(n lg n) time and O(1) space. – jonderry Jun 22 '11 at 05:31
  • He said an array of n+1. That situation could never happen because n=2 – Justin Simon Jun 22 '11 at 10:35
  • In my example, n=3 and the array is of length n+1=4. – j_random_hacker Jun 22 '11 at 12:30
  • 1
    @j_random_hacker: I don't understand your objection. If I know that an interval [A,B) contains p distinct values in S and >=p+1 values in A, and C is the midpoint of [A,B), then either [A,C) or [C,B) contains q distinct values in S and >=q+1 values in A. In your example, if say S={2,3,42}, then the entry condition is that [2,\infty) has 3 values in S and 4 values in A, and taking C=3 I get [2,3) that has 1 value is S and 2 in A. This holds if the "midpoint" is anything in (A,B). – xofon Jun 22 '11 at 13:08
  • @xofon: You're absolutely right, I misunderstood the algorithm to be talking about comparing the sizes of each "half" in A to *each other*, rather than to the corresponding "half" of S. Ugh, not thinking clearly at all today :/ – j_random_hacker Jun 22 '11 at 14:26
  • @PengOne, my mistake. You can do it in a small polynomial amount of space though, less than O(n). The basic idea is to take a small random sample, and find an upper bound and a lower bound that are close to the median. Then look at all numbers between those two. The running time is O(n), not O(n lg n), though. – jonderry Jun 22 '11 at 17:12
  • @PengOne, check out randomized selection in Motwani's Randomized Algorithms text. – jonderry Jun 22 '11 at 17:21
  • @PengOne, I just posted an answer that redeems my earlier assertion that O(n lg n) time and O(1) space is possible. – jonderry Jun 22 '11 at 17:41
-1

If we can change the array, I think we can do it using inplace bucket sort in O(n) time and O(1) extra space.

Specifically, go through every element in the the list. For each element, check whether that number is equal to the index. If not, replace that number with the element in the index until the index and number are the same. If you see a same number in the new index, that's the duplicate.

Sheng
  • 29
  • 3
  • This does not provide an answer to the question. Once you have sufficient [reputation](http://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](http://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](http://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/low-quality-posts/14196656) – Stefan Svrkota Nov 07 '16 at 22:57