12

Given an array of n+1 integers, each in the range 1 to n, find an integer that is repeated.

I was asked this at a job interview. Here's my answer: The Pigeonhole Principle says there has to be a repeat. I tried to use a binary search approach, so I did this in Matlab, because that's what I know:

top = 0;
bot = 0;
for i=1:n+1
  if P[i] > n/2 
    top = top+1;
  else
    bot = bot+1;
end

So then I argue that one of these, top or bot, has to be bigger than n/2 by the PhP again. Take that range and repeat.

I thought this was a pretty good solution, but the interviewer sort of hinted that one can do better. Please post any better solutions you know of.

PengOne
  • 48,188
  • 17
  • 130
  • 149
Daniel
  • 944
  • 1
  • 7
  • 24
  • 2
    But there can be lots of repeats, so I don't think those questions are the same as mine. Also, sorting is really slow. A lot slower than my answer. – Daniel Jun 21 '11 at 04:56
  • @Matt Ball and other closers, I think we were hasty. The two dupes aren't really the same. – PengOne Jun 21 '11 at 05:04
  • I agree; not the same question. Deserves to be re-opened. – Nemo Jun 21 '11 at 05:05
  • @Daniel: By lots of repeats do you mean one integer repeated more than once or many integers repeated (possibly more than once)? – abcd Jun 21 '11 at 05:13
  • Both. You can have any numbers at all between 1 and n as many times. That's why it's really hard. – Daniel Jun 21 '11 at 05:15
  • @Daniel: It's not true that the sorting solution is slower than your answer. (It's slower than PengOne's answer below, though.) The sorting solution runs in O(n log n) time, while yours can be anywhere from nlogn to n^2 in the worst case depending on how exactly you implement "take that range and repeat". If you're already making one pass through the array, it's suboptimal to take only one bit of information out of it (like which range is greater) and then do the whole thing all over again. A "binary search" that doesn't run in O(log n) time shouldn't really be called binary search. – ShreevatsaR Jun 21 '11 at 05:44
  • Related, possibly duplicates: • [Why do I have to sum to find the repeating number](http://stackoverflow.com/questions/4414388/why-do-i-have-to-sum-to-find-the-repeating-number) • [Algorithm to find two repeated numbers in an array without sorting](http://stackoverflow.com/questions/555744/algorithm-to-find-two-repeated-numbers-in-an-array-without-sorting) – jscs Jun 21 '11 at 05:55
  • I am missing some constraints here. I don't see the problem. Speed? Memory? Find `any` (headline) or find `an` (first sentence)? – user unknown Jun 21 '11 at 23:29
  • Deleted my incorrect claim that your algorithm was incorrect :/ – j_random_hacker Jun 22 '11 at 14:30
  • @ShreevatsaR: I think this is O(n log n) time: it's binary subdivision where the test step happens to be an O(n) pass through all elements that accumulates 2 totals. Each step breaks an interval into 2 roughly-half-sized intervals, at least one of which must have "too many" elements appearing in the array. Although as the search proceeds and the interval we consider gets narrower, more and more of the elements visited in each pass won't contribute to either total because they're outside both candidate subranges, that's not too serious as there will be only log(n) passes anyway. – j_random_hacker Jun 22 '11 at 14:42
  • @j_random_hacker: Yes, if done right, this will be O(n log n) time. I wasn't very sure what happens after the step mentioned in the question, so I said "anywhere from nlogn to n^2". In any case, the simple solution of sorting the array is O(n log n) as well, so this is not faster. – ShreevatsaR Jun 22 '11 at 16:15

6 Answers6

18

I'm not sure how you're defining "better", but perhaps this qualifies. At least it's different from your solution and the solutions to the linked list questions (pun intended).

If we make a path

n+1 --> array[n+1] --> array[array[n+1]] --> ...

then this path contains a cycle if and only if array^k[n+1] = array^l[n+1] for some k != l, that is, if and only if there is a repeat. The question now becomes a common linked list problem that can be solved as follows.

Start two particles on the first node. Let the first particle move at unit speed and let the second particle move at twice unit speed. Then if there is a cycle, the second particle will end up looping back behind the first, and eventually they'll be the same. Why? Well, if you think of the particles as on a circle (which they will be once the find the loop), at every time unit the second particle gets one directed step closer to the first. Therefore they must eventually collide. One they do, you've found a loop. To find the repeated value, simply get the length of the loop by letting one particle stand still while the other runs the loop again. Then start both particles at the start again, let one move the length of the loop ahead, and then run both particles together with constant distance between them until they meet again at the beginning of the loop.

As some commentators are outraged that I did not include all of the details of how to find a loop in a linked list, here it now is. No promises that this isn't buggy (it's Matlab-esque pseudocode after all), but it should at least explain the idea.

%% STEP 1: find a point in the cycle of the linked list using a slow and fast particle
slow = n+1;
fast = n+1;
for i=1 to n+1
    slow = array[slow];
    fast = array[array[fast]];
    if (slow == fast)
        break;
end

%% STEP 2: find the length of the cycle by holding one particle fixed
length = 1;
fast = array[fast]
while fast != slow
    fast = array[fast];
    length = length+1;
end

%% STEP 3: find the repeated element by maintaining constant distance between particles
slow = n+1;
fast = n+1;
for i=1 to length
    fast = array[fast];
end
while fast != slow
    fast = array[fast];
    slow = array[slow];
end

%% STEP 4: return the repeated entry
return slow;

Here I started at n+1 because array[i] is between 1 and n, so neither particle will ever be sent back to n+1. This makes at most one pass through the array (though not in order) and keeps track of two particles (slow and fast) and one integer (length). The space is therefore O(1) and the time is O(n).

PengOne
  • 48,188
  • 17
  • 130
  • 149
  • What if fast jumps over slow? This might go on forever. – Daniel Jun 21 '11 at 05:09
  • Linear time, constant space. Any other way to do that? Asked like this, it's actually a good interview question. – Nemo Jun 21 '11 at 05:10
  • 2
    @Daniel: No, it cannot. This is the standard "pointer chasing" technique to detect cycles in a list. slow moves +1 each step, fast moves +2, so fast "gains" on slow by +1 each step. It cannot skip over it. – Nemo Jun 21 '11 at 05:10
  • 2
    @Daniel: A good question, but think about what happens when the fast particle is coming up behind the slow one. If the fast is one step behind, then they collide on the next step. If the fast one is two steps behind, then they collide in two steps. Make sense. – PengOne Jun 21 '11 at 05:11
  • Ok. I think I get it. Thank you. – Daniel Jun 21 '11 at 05:16
  • 1
    This is linear time and constant space, and can be proved optimal. You need at least linear time (you have to look at every array entry), and you need at least constant space. – ShreevatsaR Jun 21 '11 at 05:39
  • Yes, I think you're right. I like this one. Is there a good place to read more about "pointer chasing"? – Daniel Jun 21 '11 at 06:04
  • 1
    @Daniel: You can look for "Floyd's cycle-finding algorithm". [Wikipedia has a little](http://en.wikipedia.org/wiki/Cycle_detection#Tortoise_and_hare); but Googling for related terms can find more sources. Also, [Pollard's rho algorithm](http://en.wikipedia.org/wiki/Pollard's_rho_algorithm) is based on this; expositions of it may also help. – ShreevatsaR Jun 21 '11 at 06:09
  • Yes, I see and understand. `Any` isn't `all`. Since my comment isn't useful for anybody else, I will remove it soon. Just few minutes for you to recognise, so that you can remove your answers too. Thanks – user unknown Jun 22 '11 at 00:30
  • 1
    @fiver I have no clue what you're trying to communicate, but I'd like to point out that with the input n=7, array=[2,3,4,5,6,7,3,1] this program says that the repeated element is 3. And in fact it gives accurate output for all the data I've tested it on. Did you actually test it? – hobbs Jun 22 '11 at 00:31
  • @hobbs: this is because the author corrected it - now it works, but his original solution didn't work, because he was not finding the beginning of the cycle :) I'd at least expect a comment from him, but oh well. – Petar Ivanov Jun 22 '11 at 03:13
  • I am sorry I pissed you off - I think I already appologized for it. I was just surprised that so many people with high popularity didn't see that the solution was incorrect. It was nothing against you. But what pisses me off is that you don't admit your answer was not correct - you deleted all your comments (so I had to do the same, because mine were replies to yours) and corrected your asnwer, as if nothing has happened. Anyway, I don't care, as soon as the answer is correct, which now it is, I'm fine. Good luck. – Petar Ivanov Jun 22 '11 at 04:27
  • 3
    @fiver: The solution wasn't incorrect at any time, just incomplete. It illustrated the core idea (cycle detection), which was sufficient for anyone who thinks hard enough to complete the algorithm. But yes, a complete answer is better, which it now is. Perhaps thanks to your comments. :-) – ShreevatsaR Jun 22 '11 at 16:24
  • 2
    Ah, the critical part is starting at element n+1, since that can't be part of any "ordinary" duplicate-free cycle... Very clever! – j_random_hacker Jun 28 '11 at 01:57
  • Wonderful! Here's an implementation in python, with tests: https://gist.github.com/theicfire/37609e65230db47fae9e – theicfire Dec 24 '14 at 05:11
  • I posted a simpler answer. Finding the length of cycle is not needed (step 2, step 3 in this answer). My answer relies on the fact that the `meet node`, where `slow` and `fast` meet, and `start node` both are same hops away from `cycle node`. So two pointers starting at `meet node` and `start node` will meet at `cycle node` when advanced at 1x speed. The `cycle node` is what the repeating elements point to. – hIpPy Jul 29 '16 at 19:34
3

If you know that there is exactly one number that is duplicate you can find it by summing all of them and subtracting the sum of numbers from 1 to n:

duplicate = sum P[i] - n(n+1)/2

If not, then you can iterate through the array and put each number in a hashtable. If the number already exists then that's the duplicate. This is also O(n) assuming the hashtable operations are O(1).

Or event better - to avoid the hashtable you can use an array of booleans of size n:

int[] P = new int[] { 3, 2, 5, 1, 4, 2 };
bool[] Q = new bool[6];

foreach( var p in P ){
    if ( Q[p] ) {
        Console.WriteLine("Duplicate: " + p);
        break;
    }
    Q[p] = true;
}
Petar Ivanov
  • 91,536
  • 11
  • 82
  • 95
  • This second one is slower than the binary solution I think. – Daniel Jun 21 '11 at 05:08
  • 1
    Not really. This iterates over the array only once, while the binary solution iterates over some parts of the array more than once. – Petar Ivanov Jun 21 '11 at 05:21
  • But you have to save too much information. I want to do better in space and time. – Daniel Jun 21 '11 at 05:23
  • 3
    True. But when you say "slower" you mean time. So it's not slower! It's faster, but takes more space. – Petar Ivanov Jun 21 '11 at 05:27
  • This is linear time and linear space, which *is* optimal in time. It is definitely faster than the so-called binary-search solution in the question. It is also optimal, time-wise, though the space can be improved as in PengOne's answer. – ShreevatsaR Jun 21 '11 at 05:47
  • This one takes more space, but I see that it's better than my answer. THanks. – Daniel Jun 21 '11 at 06:06
  • @ShreevatsaR: the space can only be optimized if we do not mind destructing the input. If we want to preserve it, then using an array of booleans is cheaper that copying the array itself. – Matthieu M. Jun 21 '11 at 13:36
  • @Matthieu: PengOne's answer does not destroy the input. It does not modify any element of `array`. – ShreevatsaR Jun 21 '11 at 13:41
  • @ShreevatsaR: right! I read too fast and thought he was creating the list in place. Awesome answer then (though factoring in the effect of caching could make it slowish because of the random accesses). – Matthieu M. Jun 21 '11 at 14:00
0

This works in a similar way as @PengOne's answer but it is simpler I believe.

Explanation:

This approach treats the array as a graph where value at index i points to index a[i]-1 (so value 1 points to index 0). There is at least 1 repeating number, so the graph will be cyclic. There are n+1 elements and the max is n, so the last node a[n+1] will never be a part of a cycle but will enter a cycle. This is important as this last node is the start node for traversal. Note that if a node which is part of a cycle is used as start node with slow (1x) and fast (2x) pointers then they meet at that same node itself which is not useful. Let's call the converging node as meet node. If the meet node is k hops away from the cycle node, the start node will also be k hops away from the cycle node. This logic is same as finding the cycle node in a cyclic linked list. The array is traversed a max of 3 times so O(n) time and O(1) space.

enter image description here

Algo:

  1. Starting from last node (a[n+1]), find the meet node using slow (1x) and fast (2x) pointers.
  2. Advance two pointers (1x) from meet node and from start node and they will converge at the cycle node (The repeating numbers point to cycle node).

Code:

//pseudocode
//O(n) time, O(1) space
findrepeating(a):
    x = findrepeating(a, findmeet(a), a[a.length() -1])
    return x

findmeet(a):
    slow = fast = a[a.length() -1]
    while true:
        slow = a[slow-1]
        fast = a[a[fast-1]-1]
        if slow == fast:
            break
    meet = slow // or fast
    return meet

findrepeating(a, meet, start):
    m = meet
    s = start
    while m != s:
        m = a[m-1]
        s = a[s-1]
    return m // or s
hIpPy
  • 4,649
  • 6
  • 51
  • 65
0

We use circle detection's idea to solve this problem.

All we need to do is first find the begin of the circle and then find the duplicated one in the circle.

Here's the code in c++:

int findDuplicate(vector<int>& nums) {
    int slow = nums[0];
    int fast = nums[nums[0]];

    while(slow != fast){
        slow = nums[slow];
        fast = nums[nums[fast]];
    }
    fast = 0;
    while(slow != fast){
        slow = nums[slow];
        fast = nums[fast];
    }

    return slow;
}
Boooooooooms
  • 306
  • 4
  • 21
0

How about this simple solution:

start creating a binary search tree from the array. Whenever there is an element already present which is a duplicate while you are inserting into the BST then store that element in another array of duplicate elements and continue your loop .we don't even need to sort the array for finding the duplicates here.

This is just my idea.I was asked the same question in one of the interviews and this was my answer.

Vijay
  • 65,327
  • 90
  • 227
  • 319
-1
for(int i=n+1;i!=a[i];i=a[i]);

cout<<i;
ans
  • 3
  • 2