algorithm removing duplicate elements in array without auxillay storage

Question

I am working on this famous interview question on removing duplicate elements in array without using auxillary storage and preserving the order;

I have read a bunch of posts; Algorithm: efficient way to remove duplicate integers from an array, Removing Duplicates from an Array using C.

They are either implemented in C (without explanation) or the Java Code provided just fails when there is consecutive duplicates such as [1,1,1,3,3].

I am not quite confident with using C, my background is Java. So I implemented the code myself; it follows like this:

use two loops, the outer-loop traverses the array and inner loop checks for duplicates and if present replace it with null.
Then I go over the duplicate-replaced-null array and remove null elements and replacing it with the next non-null element.

The total run-time I see now is O(n^2)+O(n) ~ O(n^2). Reading the above posts, I understood this is the best we can do, if no sorting and auxiliary storage is allowed. My code is here: I am looking for ways to optimize any further (if there is a possibility) or a better/simplisitc logic;

public class RemoveDup {
    public static void main (String[] args){
        Integer[]  arr2={3,45,1,2,3,3,3,3,2,1,45,2,10};
            Integer[] res= removeDup(arr2);
                System.out.println(Arrays.toString(res));
            }
          private static Integer[] removeDup(Integer[] data) {
            int size = data.length;
            int count = 1;
                for (int i = 0; i < size; i++) {
                    Integer temp = data[i];
                    for (int j = i + 1; j < size && temp != null; j++) {
                        if (data[j] == temp) {
                            data[j] = null;
                        }
                    }
                }
                for (int i = 1; i < size; i++) {
                    Integer current = data[i];
                    if (data[i] != null) {
                        data[count++] = current;
                    }
                }

                return Arrays.copyOf(data, count);

         }

}

EDIT 1; Reformatted code from @keshlam throws ArrayIndexOutofBound Exception:

private static int removeDupes(int[] array) {
        System.out.println("method called");
        if(array.length < 2)
          return array.length;

        int outsize=1; // first is always kept

     for (int consider = 1; consider < array.length; ++consider) {

          for(int compare=0;compare<outsize;++compare) {
            if(array[consider]!=array[compare])
                array[outsize++]=array[consider]; // already present; advance to next compare
           else break;
          // if we get here, we know it's new so append it to output
          //array[outsize++]=array[consider]; // could test first, not worth it. 

        }

      }
        System.out.println(Arrays.toString(array));
         // length is last written position plus 1
        return outsize;
    }

I think, that under that restrictions you cannot do better than O(N^2). So code looks ok. — AlexWien, Feb 06 '14 at 20:37
I don't think that `Arrays.copyOf()` is doing what you want it to do. — Nir Alfasi, Feb 06 '14 at 20:39
There's a solution I like better which requires only one pass through the outer loop, but I'm not sure it's any better on the O() scale. — keshlam, Feb 06 '14 at 20:40
@keshlam: I would love to hear that pass. could post an answer out of it. — brain storm, Feb 06 '14 at 20:42
I'd rather leave it as exercise for the reader, since knowing that it's possible you *should* be able to figure it out for yourself. Hint keywords: Incremental, and Split. — keshlam, Feb 06 '14 at 20:49
Your program works. I would get rid of the stuff about null and just move the unique numbers to the front of the array. Since you have to scan the list of unique numbers for each new number you test you will not be able to do better than O(N^2) — Marichyasana, Feb 06 '14 at 20:51
When you finish the first (two nested) for-loops the array has "holes" of nulls. `Arrays.copyOf()` takes as the second argument the "size", copies the entire array and pad it with nulls/zeros from "size" till the end. That's not what you want - what you want is to "defragment" the array. — Nir Alfasi, Feb 06 '14 at 20:52
@keshlam: I dont understand from your keywords, but I guess someone else can point me with elaborate details — brain storm, Feb 06 '14 at 20:56
@alfasin: I am not returning array with nulls. I am removing in the next pass (the for-loop below); with Arrays, it is not possible to add or remove elements from it.. — brain storm, Feb 06 '14 at 20:58
@keshlam: The one you were thinking of (from below comment) will have O(n^3)..? — brain storm, Feb 06 '14 at 21:14
@user1988876 You are already using an auxiliary data-structure since you're using `Arrays.copyOf`. If you know the range of the integers in the array - you can use a modified version of bucket-sort to remove duplicates. Not sure about preserving the order though. — Nir Alfasi, Feb 06 '14 at 21:19
@alfasin: in a way yes, I am using auxillary stucture, but that is simply because arrays are fixed size, no deletions or additions.. — brain storm, Feb 06 '14 at 21:36

keshlam · Accepted Answer · 2014-02-07T14:06:47.863

3

OK, here's my answer, which should be O(N*N) worst case. (With smaller constant, since even worstcase I'm testing N against -- on average -- 1/2 N, but this is computer science rather than software engineering and a mere 2X speedup isn't significant. Thanks to @Alexandru for pointing that out.)

1) Split cursor (input and output advanced separately),

2) Each new value only has to be compared to what's already been kept, and compare can stop if a match is found. (The hint keyword was "incremental")

3) First element need not be tested.

4) I'm taking advantage of labelled continue where I could have instead set a flag before breaking and then tested the flag. Comes out to the same thing; this is a bit more elegant.

4.5) I could have tested whether outsize==consider and not copied if that was true. But testing for it would take about as many cycles as doing the possibly-unnecessary copy, and the majority case is that they will not be the same, so it's easier to just let a possibly redundant copy take place.

5) I'm not recopying the data in the key function; I've factored out the copy-for-printing operation to a separate function to make clear that removeDupes does run entirely in the target array plus a few automatic variables on the stack. And I'm not spending time zeroing out the leftover elements at the end of the array; that may be wasted work (as in this case). Though I don't think it would actually change the formal complexity.

import java.util.Arrays;

public class RemoveDupes {

  private static int removeDupes(final int[] array) {
    if(array.length < 2)
      return array.length;

    int outsize=1; // first is always kept

    outerloop: for (int consider = 1; consider < array.length; ++consider) {

      for(int compare=0;compare<outsize;++compare)
        if(array[consider]==array[compare])
          continue outerloop; // already present; advance to next compare

      // if we get here, we know it's new so append it to output
      array[outsize++]=array[consider]; // could test first, not worth it. 
    }

    return outsize; // length is last written position plus 1
  }

  private static void printRemoveDupes(int[] array) {
    int newlength=removeDupes(array);
    System.out.println(Arrays.toString(Arrays.copyOfRange(array, 0, newlength)));
  }

  public static void main(final String[] args) {
    printRemoveDupes(new int[] { 3, 45, 1, 2, 3, 3, 3, 3, 2, 1, 45, 2, 10 });
    printRemoveDupes(new int[] { 2, 2, 3, 3 });
    printRemoveDupes(new int[] { 1, 1, 1, 1, 1, 1, 1, 1 });
  }
}

LATE ADDITION: Since folks expressed confusion about point 4 in my explanation, here's the loop rewritten without labelled continue:

for (int consider = 1; consider < array.length; ++consider) {
  boolean matchfound=false;

  for(int compare=0;compare<outsize;++compare) {
    if(array[consider]==array[compare]) {
      matchfound=true;
      break;
    }

    if(!matchFound) // only add it to the output if not found
      array[outsize++]=array[consider];
}

Hope that helps. Labelled continue is a rarely-used feature of Java, so it isn't too surprising that some folks haven't seen it before. It's useful, but it does make code harder to read; I probably wouldn't use it in anything much more complicated than this simple algorithm.

edited Feb 07 '14 at 14:06

answered Feb 06 '14 at 22:07

keshlam

7,931
2
19
33

1

Nop. Even if not every value is tested against every other value it can still be `O(n²)` but with a smaller constant. So you have, 1 test, then 2 test, the 3 and up to `N-1`: [Check](http://www.wolframalpha.com/input/?i=sum+i%2C+i%3D1..n-1) the result of this sum on wolframalpha. – Alexandru Barbarosie Feb 06 '14 at 22:51
1

.... Aaaaah. It averages out to N * N/2, which is indeed O(N*N) when the 1/2 is factored out. My bad, good catch, I'll fix the description. – keshlam Feb 06 '14 at 22:54
+1 I am not sure if it is `O(nlogn)`, but it looks like you could get duplicates removed in one pass. I will need to think on how it works now. – brain storm Feb 06 '14 at 22:57
@AlexandruBarbarosie, if we were allowed to use hashtable orother fast lookup in aux storage for the have-we-seen-it-before test, that _would_ reduce complexity, right? – keshlam Feb 06 '14 at 23:03
regarding point 5, "....removeDupes does run entirely in the target array plus a few automatic variables on the stack.". I do not understand what by automatic variables you mean here? – brain storm Feb 06 '14 at 23:03
1

@keshlam yep, then it would have been just one pass. – Alexandru Barbarosie Feb 06 '14 at 23:06
Automatic variables: Local variables. "Automatic" is a jargonish way of saying that they live in the stack frame rather than the heap and their space is reused when the function returns. (In aome languages objects can also be in automatic storage; not Java, at least not before JIT optimization.) – keshlam Feb 06 '14 at 23:09
2

Thanks again for the sanity check, AB -- and for keeping me honest. As may be obvious, I'm primarily an engineer, not a scientist. – keshlam Feb 06 '14 at 23:11
@keshlam although comment of this type are preferred to be avoided on so, you are welcome! – Alexandru Barbarosie Feb 06 '14 at 23:22
I replaced `continue outerloop` with `break;` and removed the `outerloop` label. but now it does not work..I thought it should work in principle – brain storm Feb 06 '14 at 23:27
@user1988876 because `break` breaks the 2nd for loop and you continue with `array[outsize++]=array[consider]` instead of avoiding it as in keshlam's case where continue jumps over that assignment. – Alexandru Barbarosie Feb 06 '14 at 23:29
@keshlam: I reformatted your code in a way I could actually break out of the loop, but that throws arrayIndexOutofBoundException. I have put my code in the EDIT above. would you please have a look for a sec and point me where I am going wrong – brain storm Feb 06 '14 at 23:39
If you want to get rid of `continue outerloop`, you need to have a flag variable which indicates whether you exited the inner loop because you found a match or because you failed to do so, and then you have to test that variable and use that to control whether the value gets added to the output. That wouldn't change the complexity, but would make the code a bit slower... but would also make it easier to read. (As I said in my notes, actually.) – keshlam Feb 06 '14 at 23:40
@AlexandruBarbarosie: I tried the following: the `break` breaks the second for-loop but does not continue with `array[outsize++]=array[consider]`. Here it is: `for (int consider = 1; consider < array.length; ++consider) { for(int compare=0;compare – brain storm Feb 06 '14 at 23:46
1

Unformatted comments are really not the right place to discuss extended chunks of code. If you need help understanding why that doesn't work, you're better off starting a new Question. But _please_ re-read what I've already said: You need to be able to distinguish between the looped exiting because the test found a match and because it didn't, and use that to control whether the assignment occurs. That means you need a flag variable which starts off `false` and becomes `true` only when the break occurs (or vice versa), and you need an `if` statement to test that flag and control the assignment – keshlam Feb 06 '14 at 23:56
... all of which I avoided by using the labelled continue, which not only exits the inner loop but skips the rest of its body and moves immediately to the next iteration of the outer loop. It sounds like you may want to hit a good Java tutorial to understand what labelled continue does; it's not a commonly used language feature, so it isn't surprising you're being confused by it. – keshlam Feb 06 '14 at 23:58
@keshlam: I understood your solution and I was trying to change the continue part to break the inner loop; but now I need to re-think on the lines of setting up a flag and check how it goes – brain storm Feb 07 '14 at 00:08
Added illustration of the alternative to labelled `continue`. Hopefully you'll find that easier to read than the English explanation. – keshlam Feb 07 '14 at 00:15
@keshlam: I was able to follow the logic of using the flag. Thanks for the code fragment. But I found that in your current fragment, you are doing assignment as part of inner loop, so I added an EDIT above. If you think that is correct, you could either leave it as it is or modify yours and delete it. If it is wrong, (which I dont think, as it gives right output), please feel to delete, but let me know in comment here. And please accept Big THANKS! – brain storm Feb 07 '14 at 05:59
Ah. I see what you're getting at. Yeah, scope error, that's what I get for not testing. I actually want a slightly different solution than yours, again for scoping reasons. – keshlam Feb 07 '14 at 14:06

Reverend Gonzo · Answer 2 · 2014-02-06T21:20:45.613

Here one version which doesn't use additional memory (except for the array it returns) and doesn't sort either.

I believe this is slightly worse than O(n*log n).

Edit: I'm wrong. This is slightly better than O(n^3).

public class Dupes {

    private static int[] removeDupes(final int[] array) {
        int end = array.length - 1;
        for (int i = 0; i <= end; i++) {
            for (int j = i + 1; j <= end; j++) {
                if (array[i] == array[j]) {
                    for (int k = j; k < end; k++) {
                        array[k] = array[k + 1];
                    }
                    end--;
                    j--;
                }
            }
        }

        return Arrays.copyOf(array, end + 1);
    }

    public static void main(final String[] args) {
        System.out.println(Arrays.toString(removeDupes(new int[] { 3, 45, 1, 2, 3, 3, 3, 3, 2, 1, 45, 2, 10 })));
        System.out.println(Arrays.toString(removeDupes(new int[] { 2, 2, 3, 3 })));
        System.out.println(Arrays.toString(removeDupes(new int[] { 1, 1, 1, 1, 1, 1, 1, 1 })));
    }
}

and here's a modified version which doesn't shift all of the elements from after the dupe. Instead it simply switches the dupe with the last, non-matching element. This obviously can't guarantee order.

private static int[] removeDupes(final int[] array) {
    int end = array.length - 1;
    for (int i = 0; i <= end; i++) {
        for (int j = i + 1; j <= end; j++) {
            if (array[i] == array[j]) {
                while (end >= j && array[j] == array[end]) {
                    end--;
                }
                if (end > j) {
                    array[j] = array[end];
                    end--;
                }
            }
        }
    }

    return Arrays.copyOf(array, end + 1);
}

It's my impression or the two outer loops are also `O(n^2)` ? You're not cutting in a half in the 2nd `for` as for assuming that `for` will have `O(log n)`. As @alfasin says, the 3 `fors` will be `O(n^3)` — Daniel, Feb 06 '14 at 21:04
The second loop is on average half the size of the first loop since it iterates over the remaining portion of the first loop. The third loop is on average half the size of the second loop, but it also decreases the size of the list. — Reverend Gonzo, Feb 06 '14 at 21:11
Actually, I think mine does better than this. Lemme actually get it onto paper and check... — keshlam, Feb 06 '14 at 21:14

Alexandru Barbarosie · Answer 3 · 2014-02-07T00:08:37.647

Here you have a worst case of O(n^2) where the return points to the first non unique element. So everything before it is unique. Instead of C++ iterators indices in Java can be used.

std::vecotr<int>::iterator unique(std::vector<int>& aVector){
    auto end = aVector.end();
    auto start = aVector.begin();
    while(start != end){
        auto num = *start; // the element to check against
        auto temp = ++start; // start get incremented here
        while (temp != end){
            if (*temp == num){
                std::swap(temp,end);
                end--;
            }
            else
                temp++; // the temp is in else so that if the swap occurs the algo should still check the swapped element.
        }
    }
return end;
}

Java equivalent code: (the return will be an int which is the index of the first not unique element)

int unique(int[] anArray){
        int end = anArray.length-1;
        int start = 0;
        while(start != end){
            int num = anArry[start]; // the element to check against
            int temp = ++start; // start get incremented here
            while (temp != end){
                if (anArry[temp] == num){
                    swap(temp,end); // swaps the values at index of temp and end
                    end--;
                }
                else
                    temp++; // the temp is in else so that if the swap occurs the algo should still check the swapped element.
            }
        }
    return end;
    }

The slight difference in this algo and yours is in your point 2. Where instead of replacing the current element with null you go with swapping it with the last possibly unique element which on the first swap is the last element of array, on second swap the second last and so on.

You might as well consider looking at the std::unique implementation in C++ which is linear in one less than the distance between first and last: Compares each pair of elements, and possibly performs assignments on some of them., but as it was noted by @keshlam it is used on sorted arrays only. The return value is the same as in my algo. Here is the code directly from the standard library:

template<class _FwdIt, class _Pr> inline
    _FwdIt _Unique(_FwdIt _First, _FwdIt _Last, _Pr _Pred)
    {   // remove each satisfying _Pred with previous
    if (_First != _Last)
        for (_FwdIt _Firstb; (_Firstb = _First), ++_First != _Last; )
            if (_Pred(*_Firstb, *_First))
                {   // copy down
                for (; ++_First != _Last; )
                    if (!_Pred(*_Firstb, *_First))
                        *++_Firstb = _Move(*_First);
                return (++_Firstb);
                }
    return (_Last);
    }

sorry, I have zero knowledge of C++. so it is difficult for me to follow. Atleast I see you using pointers(*) which is not allowed in Java. so that makes it tough..or if you could comment above each line, what it is doing, it can help me in understanding — brain storm, Feb 06 '14 at 22:54
Note that `uniq` only removes adjacent duplicates. If you want to remove all duplicates, you need to sort first -- which puts the complete problem as originally stated back up into N-squared space. (Alexandru knows this, I'm just pointing it out for anyone who might be overexcited by the fact that `uniq` itself is linear.) — keshlam, Feb 07 '14 at 00:03
@keshlam you are right, totally forgot about that. It's a 1-1 ) — Alexandru Barbarosie, Feb 07 '14 at 00:05
@keshlam: I tried with this and it worked `int[] arr2 = { 3, 45, 1, 2, 3, 3, 3, 3, 2, 1, 45, 2, 10 };` and the duplicates here are not consecutive — brain storm, Feb 07 '14 at 00:05

epsilonhalbe · Answer 4 · 2014-02-07T00:12:53.800

0

To bring in a bit perspective - one solution in Haskell, it uses lists instead of arrays and returns the reversed order, which can be fixed by applying reverse at the end.

import Data.List (foldl')

removeDup :: (Eq a) => [a] -> [a]
removeDup = foldl' (\acc x-> if x `elem` acc then acc else x:acc) []

edited Feb 07 '14 at 00:12

answered Feb 06 '14 at 23:44

epsilonhalbe

15,637
5
46
74

algorithm removing duplicate elements in array without auxillay storage

4 Answers4

Linked