Finding repeating signed integers with O(n) in time and O(1) in space

Question

(This is a generalization of: Finding duplicates in O(n) time and O(1) space)

Problem: Write a C++ or C function with time and space complexities of O(n) and O(1) respectively that finds the repeating integers in a given array without altering it.

Example: Given {1, 0, -2, 4, 4, 1, 3, 1, -2} function must print 1, -2, and 4 once (in any order).

EDIT: The following solution requires a duo-bit (to represent 0, 1, and 2) for each integer in the range of the minimum to the maximum of the array. The number of necessary bytes (regardless of array size) never exceeds (INT_MAX – INT_MIN)/4 + 1.

#include <stdio.h>

void set_min_max(int a[], long long unsigned size,\
                 int* min_addr, int* max_addr)
{
    long long unsigned i;

    if(!size) return;
    *min_addr = *max_addr = a[0];
    for(i = 1; i < size; ++i)
    {
        if(a[i] < *min_addr) *min_addr = a[i];
        if(a[i] > *max_addr) *max_addr = a[i];
    }
}

void print_repeats(int a[], long long unsigned size)
{
    long long unsigned i;
    int min, max = min;
    long long diff, q, r;
    char* duos;

    set_min_max(a, size, &min, &max);
    diff = (long long)max - (long long)min;
    duos = calloc(diff / 4 + 1, 1);
    for(i = 0; i < size; ++i)
    {
        diff = (long long)a[i] - (long long)min; /* index of duo-bit
                                                    corresponding to a[i]
                                                    in sequence of duo-bits */
        q = diff / 4; /* index of byte containing duo-bit in "duos" */
        r = diff % 4; /* offset of duo-bit */
        switch( (duos[q] >> (6 - 2*r )) & 3 )
        {
            case 0: duos[q] += (1 << (6 - 2*r));
                    break;
            case 1: duos[q] += (1 << (6 - 2*r));
                    printf("%d ", a[i]);
        }
    }
    putchar('\n');
    free(duos);
}

void main()
{
    int a[] = {1, 0, -2, 4, 4, 1, 3, 1, -2};
    print_repeats(a, sizeof(a)/sizeof(int));
}

Does your solution work for large input arrays when your multiplication overflows ? — parapura rajkumar, Nov 21 '11 at 07:13
Pubby: That question has only numbers 0..n-1 in array of size n. — Apshir, Nov 21 '11 at 07:14
Strictly speaking your solution is neither O(n) time nor O(1) space. It overflows for `{5,5,5,5,5,5,5}`. To solve this you will need to use arbitrary precision arithmetics which has non O(1) time and space complexity. — Yakov Galka, Nov 21 '11 at 07:17
I agree with @ybungalobill, your proposed solution does not meet the constraints. For example, to work for any input you need to be able to calculate O(n) prime numbers, which either requires O(n) space for a table or O(n^2) time if they're calculated on the fly. — caf, Nov 21 '11 at 07:25
yubungalobill: It overflows because p rapidly grows . In theory it works fine though; check {5,5,5}. Why is it not O(n)? Why not O(1)? — Apshir, Nov 21 '11 at 07:29
If you have support for arbitrarily large integers (in O(1)) the overflow problem does not exist. But then you need the list of prime numbers up to your largest input (*2) which you cannot know in advance. So you cannot hardcode the primes. The computation of the primes is neither O(n) in time nor O(1) in space. — undur_gongor, Nov 21 '11 at 07:38
Even if you had an oracle that calculates the primes for you, up to `n` there are `O(logn)` prime numbers, so your program is not `O(1)` space I am afraid... — amit, Nov 21 '11 at 07:41
@parapura rajkumar: Are you asking "Does your solution work..." or are you answering "when ... overflows?" — Apshir, Nov 21 '11 at 07:56
@Afshin: In *theory* you need O(n) bits for the variable `p`. You can't magically compress O(n) data into O(1) space. — Yakov Galka, Nov 21 '11 at 08:29
@ybungalobill: OK, then p overflows quickly. That's why this solution works in small scale. If there are no other solutions, then we might be able to conclude this problem in a practical way doesn't have solution. — Apshir, Nov 21 '11 at 08:44
@ybungalobill: I had to add "without altering the array" after "doesn't have solution." — Apshir, Nov 21 '11 at 09:00
@ybungalobill: If your solution just allocates a big-ass array (512MB would do to give you a bit for every possible 32-bit number) then you've got constant space. It won't be _efficient_ but it would be a technically correct solution… — Donal Fellows, Nov 21 '11 at 09:22
@DonalFellows your understanding of asymptotic analysis is wrong. If you restrict your algorithm to 512MB of data, you can't talk about complexity anymore, since it won't solve the problem when n → ∞. — Yakov Galka, Nov 21 '11 at 09:35
@DonalFellows It's like finding the nth prime number in constant time: store a table of the first 10000 primes and return the nth cell... well, you didn't solve the problem since your *algorithm* doesn't work for e.g. n = 10001. — Yakov Galka, Nov 21 '11 at 09:40
@ybungalobill: But that's banging your head against a brick wall in order to go through it when there is a gate 3 meters away. With two 512MB arrays, _you don't need primes to solve the problem for 32-bit integers_. You just have two flag bits (“seen at least once”, “seen at least twice”) per integer. Since the memory allocated isn't done with any respect for the size of the input data, it must be O(1) but with one heck of a constant factor. You also get trivial O(n) time behavior. — Donal Fellows, Nov 21 '11 at 09:47
@ybungalobill: Donal's understanding of asymptotic analysis is correct. `n`, the size of the input array, is unbounded in his asymptotic analysis. It does not necessarily follow that there is no limit on the size of each individual value in the input, and since the question states that it's a C++ array of integers, and the only so-called "integer types" in C++ are fixed-width, that means the values are bounded. Of course, if the integer type chosen is `long long` then allocating storage for even 1 bit per value is infeasible. — Steve Jessop, Nov 21 '11 at 09:59
That said, `size_t` is also a fixed-width type, making the whole analysis hypothetical. The C++ memory-model doesn't match the memory-model of a Turing machine, so asymptotic analyses are usually in inverted commas. — Steve Jessop, Nov 21 '11 at 10:40
@Steve: Indeed. Asymptotic analysis is useful, but you get odd results from it sometimes. (That said, it's often possible to reduce the scope of data being handled so that you can use flag arrays with less memory commitment. Or you do the usual trick, don't worry quite so much about a strict O(1)/O(N) requirement, and use a modern datastructure while relying on data characteristics to save you from the worst case; that works ever so well with non-malicious input.) — Donal Fellows, Nov 21 '11 at 13:39
@Steve Jessop The size of `size_t` is not upperbounded by standard, so it is perfectly meaningful from a theoretical perspective to consider the asymptotic behavior of a family of executables derived uniformly from a single C++ source file by compilers that provide `size_t` types of various sizes. — Per, Nov 21 '11 at 15:24
I've linked some relevant complexity theory research here : http://stackoverflow.com/q/8263672/667457 — Jeff Burdges, Nov 25 '11 at 21:22

Donal Fellows · Accepted Answer · 2011-11-23T09:22:15.367

The definition of big-O notation is that its argument is a function (f(x)) that, as the variable in the function (x) tends to infinity, there exists a constant K such that the objective cost function will be smaller than Kf(x). Typically f is chosen to be the smallest such simple function such that the condition is satisfied. (It's pretty obvious how to lift the above to multiple variables.)

This matters because that K — which you aren't required to specify — allows a whole multitude of complex behavior to be hidden out of sight. For example, if the core of the algorithm is O(n²), it allows all sorts of other O(1), O(logn), O(n), O(nlogn), O(n^3/2), etc. supporting bits to be hidden, even if for realistic input data those parts are what actually dominate. That's right, it can be completely misleading! (Some of the fancier bignum algorithms have this property for real. Lying with mathematics is a wonderful thing.)

So where is this going? Well, you can assume that int is a fixed size easily enough (e.g., 32-bit) and use that information to skip a lot of trouble and allocate fixed size arrays of flag bits to hold all the information that you really need. Indeed, by using two bits per potential value (one bit to say whether you've seen the value at all, another to say whether you've printed it) then you can handle the code with fixed chunk of memory of 1GB in size. That will then give you enough flag information to cope with as many 32-bit integers as you might ever wish to handle. (Heck that's even practical on 64-bit machines.) Yes, it's going to take some time to set that memory block up, but it's constant so it's formally O(1) and so drops out of the analysis. Given that, you then have constant (but whopping) memory consumption and linear time (you've got to look at each value to see whether it's new, seen once, etc.) which is exactly what was asked for.

It's a dirty trick though. You could also try scanning the input list to work out the range allowing less memory to be used in the normal case; again, that adds only linear time and you can strictly bound the memory required as above so that's constant. Yet more trickiness, but formally legal.

[EDIT] Sample C code (this is not C++, but I'm not good at C++; the main difference would be in how the flag arrays are allocated and managed):

#include <stdio.h>
#include <stdlib.h>

// Bit fiddling magic
int is(int *ary, unsigned int value) {
    return ary[value>>5] & (1<<(value&31));
}
void set(int *ary, unsigned int value) {
    ary[value>>5] |= 1<<(value&31);
}

// Main loop
void print_repeats(int a[], unsigned size) {
    int *seen, *done;
    unsigned i;

    seen = calloc(134217728, sizeof(int));
    done = calloc(134217728, sizeof(int));

    for (i=0; i<size; i++) {
        if (is(done, (unsigned) a[i]))
            continue;
        if (is(seen, (unsigned) a[i])) {
            set(done, (unsigned) a[i]);
            printf("%d ", a[i]);
        } else
            set(seen, (unsigned) a[i]);
    }

    printf("\n");
    free(done);
    free(seen);
}

void main() {
    int a[] = {1,0,-2,4,4,1,3,1,-2};
    print_repeats(a,sizeof(a)/sizeof(int));
}

This is all far easier than pre-calculating the first 4 billion primes… — Donal Fellows, Nov 21 '11 at 16:02
I had to give your answer an up vote. Can you implement your idea in a C++ function now? That's what the problem asks for. — Apshir, Nov 22 '11 at 07:19
The magic constant, `134217728`, should be instantly recognizable as 128M. (Thank goodness for `dc`…) — Donal Fellows, Nov 22 '11 at 09:54
The code also _strongly_ assumes a 32-bit `int` and is only practical on a 64-bit machine. Luckily, many systems are now I32LP64 with a proper amount of memory, so there's a fair chance of getting it all to work. :-) — Donal Fellows, Nov 22 '11 at 09:56
Explicitly as a C program though, it doesn't generate any compilation errors. — Apshir, Nov 22 '11 at 17:39
Crashes with no output -- on a system with an i5 processor and 6GB RAM. — Apshir, Nov 23 '11 at 07:16
@Afshin: Yeah, because I got the bit-mangling wrong. Shows what happens when I'm in a hurry and don't test… — Donal Fellows, Nov 23 '11 at 09:26

Konstantin Oznobihin · Answer 2 · 2011-11-21T09:35:06.973

5

Since you have an array of integers you can use the straightforward solution with sorting the array (you didn't say it can't be modified) and printing duplicates. Integer arrays can be sorted with O(n) and O(1) time and space complexities using Radix sort. Although, in general it might require O(n) space, the in-place binary MSD radix sort can be trivially implemented using O(1) space (look here for more details).

edited Nov 21 '11 at 09:35

answered Nov 21 '11 at 08:45

Konstantin Oznobihin

5,234
24
31

4

Radix sort is O(n) space complexity. – David Brown Nov 21 '11 at 08:51
Why is the time complexity O(n)? – a-z Nov 21 '11 at 09:05
@David Brown: radix sort can be implemented with O(1) space complexity. – salva Nov 21 '11 at 09:07
@a-z: With radix sort, time complexity is guaranteed to be O(n). – salva Nov 21 '11 at 09:08
@Konstantin Oznobihin: How can radix sort be implemented with O(n) time and O(1) space on integers? (you should explain more) – a-z Nov 21 '11 at 09:28
1

@DavidBrown: it is not for in-place binary radix sort http://en.wikipedia.org/wiki/Radix_sort#In-place_MSD_radix_sort_implementations. – Konstantin Oznobihin Nov 21 '11 at 09:30
2

The question has been edited now to forbid an in-place sort, so anyway it's no longer relevant. – Steve Jessop Nov 21 '11 at 09:50
@Konstantin Oznobihin You are still trading off time complexity and space complexity. In the algorithm referenced in the wikipedia article they use counting sort with space complexity O(bucketSize). You can fix the bucket size and then have fixed memory usage, but then your runtime is not O(n) for large enough inputs, and if you adjust the bin size based on the input you no longer have O(1) space complexity. – David Brown Nov 21 '11 at 10:48
@DavidBrown: counting sort is used only when radix is greater than 2. Read the first paragraph at http://en.wikipedia.org/wiki/Radix_sort#In-place_MSD_radix_sort_implementations carefully. Anyhow, it's no longer relevant with in-place sorting forbidden. If you like we can discuss this in chat. – Konstantin Oznobihin Nov 21 '11 at 13:33
Radix / Counting sort takes O(n log k) time for integers of size O(k). If your integers are bounded by a constant, then this becomes O(n), but if they are bounded by n it becomes O(n log n), and if they are unbounded, then you are in trouble! Hopefully in the problem statement "integer" means "int". – Imran Nov 29 '11 at 21:32

score 2 · Answer 3 · answered Nov 21 '11 at 10:01

The O(1) space constraint is intractable.

The very fact of printing the array itself requires O(N) storage, by definition.

Now, feeling generous, I'll give you that you can have O(1) storage for a buffer within your program and consider that the space taken outside the program is of no concern to you, and thus that the output is not an issue...

Still, the O(1) space constraint feels intractable, because of the immutability constraint on the input array. It might not be, but it feels so.

And your solution overflows, because you try to memorize an O(N) information in a finite datatype.

score 1 · Answer 4 · answered Nov 21 '11 at 10:12

There is a tricky problem with definitions here. What does O(n) mean?

Konstantin's answer claims that the radix sort time complexity is O(n). In fact it is O(n log M), where the base of the logarithm is the radix chosen, and M is the range of values that the array elements can have. So, for instance, a binary radix sort of 32-bit integers will have log M = 32.

So this is still, in a sense, O(n), because log M is a constant independent of n. But if we allow this, then there is a much simpler solution: for each integer in the range (all 4294967296 of them), go through the array to see if it occurs more than once. This is also, in a sense, O(n), because 4294967296 is also a constant independent of n.

I don't think my simple solution would count as an answer. But if not, then we shouldn't allow the radix sort, either.

Tudor · Answer 5 · 2011-11-21T13:24:49.833

I really don't see how you can have only O(1) space and not modify the initial array. My guess is that you need an additional data structure. For example, what is the range of the integers? If it's 0..N like in the other question you linked, you can have an additinal count array of size N. Then in O(N) traverse the original array and increment the counter at the position of the current element. Then traverse the other array and print the numbers with count >= 2. Something like:

int* counts = new int[N];
for(int i = 0; i < N; i++) {
    counts[input[i]]++;
}

for(int i = 0; i < N; i++) {
    if(counts[i] >= 2) cout << i << " ";
}

delete [] counts;

score 1 · Answer 6 · answered Nov 21 '11 at 11:24

I doubt this is possible. Assuming there is a solution, let's see how it works. I'll try to be as general as I can and show that it can't work... So, how does it work?

Without losing generality we could say we process the array k times, where k is fixed. The solution should also work when there are m duplicates, with m >> k. Thus, in at least one of the passes, we should be able to output x duplicates, where x grows when m grows. To do so, some useful information has been computed in a previous pass and stored in the O(1) storage. (The array itself can't be used, this would give O(n) storage.)

The problem: we have O(1) of information, when we walk over the array we have to identify x numbers(to output them). We need a O(1) storage than can tell us in O(1) time, if an element is in it. Or said in a different way, we need a data structure to store n booleans (of wich x are true) that uses O(1) space, and takes O(1) time to query.

Does this data structure exists? If not, then we can't find all duplicates in an array with O(n) time and O(1) space (or there is some fancy algorithm that works in a completely different manner???).

score 0 · Answer 7 · answered Nov 21 '11 at 15:35

Say you can use the fact you are not using all the space you have. You only need one more bit per possible value and you have lots of unused bit in your 32-bit int values.

This has serious limitations, but works in this case. Numbers have to be between -n/2 and n/2 and if they repeat m times, they will be printed m/2 times.

void print_repeats(long a[], unsigned size) {
    long i, val, pos, topbit = 1 << 31, mask = ~topbit;
    for (i = 0; i < size; i++)
        a[i] &= mask;

    for (i = 0; i < size; i++) {
        val = a[i] & mask;
        if (val <= mask/2) {
           pos = val;
        } else {
            val += topbit;
            pos = size + val;
        }
        if (a[pos] < 0) {
            printf("%d\n", val);
            a[pos] &= mask;
        } else {
            a[pos] |= topbit;
        }
    }
}

void main() {
    long a[] = {1, 0, -2, 4, 4, 1, 3, 1, -2};
    print_repeats(a, sizeof (a) / sizeof (long));
}

prints

4
1
-2

That doesn't work. Try `long a[] = { 10000000 } ;` (see http://ideone.com/I5IZr for results). — TonyK, Nov 22 '11 at 23:18
You're right, sorry. (But you're also right to call it a serious limitation!) — TonyK, Nov 23 '11 at 08:14

Finding repeating signed integers with O(n) in time and O(1) in space

7 Answers7

Linked