Testing whether or not an array is distinct in O(N) time and O(1) extra space - is it possible?

Question

So I found this purported interview question(1), that looks something like this

Given an array of length n of integers with unknown range, find in O(n) time and O(1) extra space whether or not it contains any duplicate terms.

There are no additional conditions and restrictions given. Assume that you can modify the original array. If it helps, you can restrict the datatype of the integers to ints (the original wording was a bit ambiguous) - although try not to use a variable with 2^(2^32) bits to represent a hash map.

I know there is a solution for a similar problem, where the maximum integer in the array is restricted to n-1. I am aware that problems like

exist and either have solutions, or answers saying that it is impossible. However, for 1. and 2. the problems are stronger than this one, and for 3. I'm fairly sure the solution offered there would require the additional n-1 constraint to be adapted for the task here.

So is there any solution to this, or is this problem unsolvable? If so, is there a proof that it is not solvable in O(n) time and O(1) extra space?

(1) I say purported - I can't confirm whether or not it is an actual interview question, so I can't confirm that anyone thought it was solvable in the first place.

Should this `len(set(L))` == `len(L)` satisfied the req.? IF considering the in-flight tmp. set is just temporary? — Daniel Hao, Aug 09 '21 at 14:57
As soon as we restrict datatype to 32-bit `int`s, the array could be sorted in 32 linear passes, which is technically `O(n)`. — user58697, Aug 09 '21 at 16:24

Joshua · Answer 1 · 2021-08-09T20:20:38.040

1

We can sort integer arrays in O(N) time! Therefore, sort and run the well-known algorithm for adjacent distinct.

    bool distinct(int array[], size_t n)
    {
        if (n > 0xFFFFFFFF)
            return true; // Pigeonhole
        else if (n > 0x7FFFFFFF)
            radix_sort(array, n); // Yup O(N) sort
        else
            heapsort(array, n); // N is small enough that heapsort's O(N log (N)) is smaller than radix_sort's O(32N) after constant adjust
        for (size_t i = 1; i < n; i++)
            if (array[i] == array[i - 1])
                return true;
        return false;
    }

edited Aug 09 '21 at 20:20

answered Aug 09 '21 at 19:07

Joshua

40,822
8
72
132

it seems radix sort have `O(n+b)` space complexity which does not satisfy OP `O(1)` requirement – badger Aug 09 '21 at 19:14
1

@hatefAlipoor: b can be constant 2. You can use the original array for scratch space. I did it once long ago. – Joshua Aug 09 '21 at 19:14
Thank you. I hope you could add more details on how to do that (using original array for scratch space) I have been fighting for this question for an hour :) – badger Aug 09 '21 at 19:22
@hatefAlipoor: The radix sort you do not struggle with for hours is not the true radix sort. I have seen the complete solution in the mind's eye but that doesn't mean I can type it out bug free in one go. I expect someone to come along and say it's cheating because if you changed the number of bits in an int you changed the number of passes on the array. Se la vive. – Joshua Aug 09 '21 at 19:34
To do a binary radix sort in place, you sort one bit at a time, so to sort k bit numbers, you'll need k sorts, starting from the msb to the lsb of the values. Each sort requires 1 or 2 passes over the array. If k is a constant (such as 32), then this is O(n). – Chris Dodd Aug 09 '21 at 20:11
@ChrisDodd: Correct, which falls below N log(N) when the array approaches 2^31 (about half the possible space). Since duplicates are the point of the question we ought not to assume that the array size is clamped at 2^32. – Joshua Aug 09 '21 at 20:13
In the question, `n` refers to the number of integers, not the size of the input, and it explicitly says that the range is unknown... so I think assuming that log(N) is O(1) is invalid, even though the OP said you could :) – Matt Timmermans Aug 09 '21 at 20:40
@MattTimmermans: I am using N as the number of integers in the array; and I am not assuming O(log(N)) is O(1) anywhere. The algorithm saturates to linear at the largest range; if it were "count the duplicates" and I couldn't pigeonhole it off at the top end it would remain linear for N > 2^32. – Joshua Aug 09 '21 at 20:55

score -1 · Answer 2 · answered Aug 09 '21 at 21:52

You can do this in expected linear time by using the original array like a hash table...

Iterate through the array, and for each item, let item, index be the item and its index, and let hash(item) be a value in [0,n). Then:

If hash(item) == index, then just leave the item there and move on. Otherwise,
If item == array[hash(item)] then you've found a duplicate and you're all done. Otherwise,
If item < array[hash(item)] or hash(array[hash(item)]) != hash(item), then swap those and repeat with the new item at array[index]. Otherwise,
Leave the item and move on.

Now you can discard all the array elements where hash(item) == index. These are guaranteed to be the smallest items that hash to their target indexes, and they are guaranteed not to be duplicates.

Move all the remaining items to the front of the array and repeat with the new, smaller, subarray.

Each step takes O(N) time, and on average will remove some significant proportion of the remaining elements, leading to O(N) time overall. We can speed things up by taking advantage all the free slots we're creating in the array, but that doesn't improve the overall complexity.

Testing whether or not an array is distinct in O(N) time and O(1) extra space - is it possible?

2 Answers2