How to efficiently eliminate duplicate elements in double array (in C)?

Question

Although I have read through the interesting thread Algorithm: efficient way to remove duplicate integers from an array , I have not been able to find a satisfying answer to my question:

I have a one-dimensional array of doubles, which usually is rather small (containing only up to three elements) - although this need not be a criterion for the sake of generality.

Additionally, I don't want to find only real duplicates, but duplicates in a sense that the elements' difference lies below a certain threshold. While this requirement is fairly easy to handle, my actual question is: How can I implement the general duplicate removal in ANSI C with as little overhead as possible?

Remark: The main reasons I have not been able to find my solution from the mentioned thread are threefold:

Many of the given solutions employ languages other than pure C, so this is not of any particular help.
Some of the solutions won't work if all elements in the array are equal, which may likely be the case in my case.
Some of the described algorithms seem to be intended for integer values only. Being a C noob, any advice is greatly appreciated therefore.

ADDENDUM: In some kind of pseudo-code, what I would want to achieve is the following:

1) Sort the array in ascending manner
2) Loop through array until element before the last one
   - Check if difference of element i to element i+1 is smaller than threshold
     -> If yes, store mean value as first element of new array
     -> If no, store original value of element i as first element of new array
3) Start the same again in order to check if now the differences between the new array elements lie below the threshold
   -> Abort if no difference is smaller than the threshold anymore

Thus, my main problem is: How can I realize step 3 such that an arbitrary number of iterations is possible and the function runs as long as there are array elements that are "too close" (with respect to my threshold).

I'll warn you the threshold condition makes this a difficult problem. Example, if you threshold was .25, and you had three values 1.0, 1.2, and 1.4, there are *multiple* possible "correct" answers (eliminate 1.2 because it is within-threshold of 1.0, eliminate 1.4 for said-same reason regarding 1.2 which may not even be there anymore, eliminate *both*, etc). — WhozCraig, Apr 04 '14 at 08:51
If the array is guaranteed to be small, a naive two-loop (quadratic) solotion is good enough. If you want to coerce adjacent values: sort+ divide into coerceble chunks + replace each chunk by its mean value — joop, Apr 04 '14 at 09:48
@joop This is what I would love to do. However: How do you divide the array into coercable chunks? I have been pondering about this problem for days now, but I still am not much closer to a solution. I'd even not care about the efficiency anymore... — Michael, Apr 08 '14 at 11:18
@WhozCraig I understand and have added my idea of how to cope with this problem in the addendum of my original question. Unfortunately, I still lack the skills to find a working C solution for this algorithm. — Michael, Apr 08 '14 at 11:34
0) are you allowed to sort the array, or should the original order be maintained? If sorting/altering the array is allowed: 1) sort 2) find chunk-boundaries 3) aggreagte chunks. (steps 2&3 can be combined) If sorting is not allowd, you'll probably end up with some kind of tree, such as an interval/range tree. — joop, Apr 08 '14 at 11:35
@joop Sorting is not a problem. Actually, I don't care about the order of the array elements at all. — Michael, Apr 08 '14 at 12:04
There is an additional question: are the contents of the array dynamically maintained? Should entries be added or removed, even after the sorting and aggregation ? And: if they are added/removed: how are they addressed, by value only? — joop, Apr 08 '14 at 12:07
@joop I am not quite sure whether I understand your question. Let's look at an example: Consider the array {4.0,4.0,1.0,17.0,15.0}. The result I would expect is {3.0,16.0} (in ideal case) or {2.5,16.0} (as best approximation if the mean value of close-by elements is taken iteratively and not simultaneously, ie. if it is ((4.0+4.0)/2 + 1.0)/2 instead of (4.0+4.0+1.0)/3). — Michael, Apr 08 '14 at 12:21
Why would you do that? It is obvious that you'll have to visit/inspect every element to detect the gaps; taking the avg of the "center" element(s) won't buy you any gain. — joop, Apr 08 '14 at 12:30
But consider the case of WhozCraig above: How would you determine the chunks in an array {1.0,1.2,1.4} if the threshold were 0.25? Would 1.2 belong to 1.0 or 1.4 then? This is why I'd like to solve it iteratively, and first reduce the array to {1.1,1.4}, before checking again and realizing that 1.1 and 1.4 belong to different chunks. Of course, {1.0,1.3} could also be a correct solution, but I guess that's the downside you have to live with... — Michael, Apr 08 '14 at 13:04
So you don't want to apply the tresholds not between the original elements, but between the newly created centroids? That indeed is a hard problem, comparable to the placement / breaking of words on lines or pages in typesetting. More or less unsolvable. (NP-complete ?) — joop, Apr 08 '14 at 13:11
First between the original elements, then again between the newly created centroids - until there are no elements left, where abs(a-b) — Michael, Apr 08 '14 at 13:21

amit · Answer 1 · 2014-04-08T13:57:19.347

2

This problem is a variation of element distinctness problem.

Since you are looking not only for 'exact duplicates' - but also for 'close duplicates', the solution cannot contain hashing.

The solution is basically sorting the array, and then iterating it and 'skipping' elements which are dupes.

This solution is O(nlogn), and is optimal, since it is the optimal solution for arbitrary element distinctness.

C-like Pseudo code:

#define epsilon SOME_SMALL_TOLERANCE_VALUE
int trimDupes(double[] arr,int n) { 
   sort(arr);
   int i = 0;
   int currPos = 0;
   double last = -Infinity; //min double, negative infinity
   for (i = 0; i < n; i++) { 
      if (abs(last-arr[i]) > epsilon) {
          arr[currPos++] = arr[i];
          last = arr[i]; //getting this out of the condition gets a bit different behavior, think what you need.
       }
    }
    return curr; //new length of the array - after it everything is garbage.
}

This solution uses very little extra space [basically whatever space is required by the sorting algorithm + a few constants], and O(nlogn) time for sorting + an extra single iteration.

edited Apr 08 '14 at 13:57

answered Apr 04 '14 at 08:51

amit

175,853
27
231
333

Solution can contain hashes, not directly though. First, round hash key to multiple of epsilon, and second, check neighbours (x+ε x−ε). – Vovanium Apr 04 '14 at 12:33
@amit Unfortunately, an implementation of this code has not worked for me. Maybe I mixed something up with currPos, curr and/or arr, since I did not really understand the last comment and the return argument. – Michael Apr 08 '14 at 11:20
@Michael What are you trying and why does it failing? The general approach should work to get rid of near dupes, and will keep only the first element (among the group of dupes) that is below the desired difference threshold. – amit Apr 08 '14 at 11:27
@amit I was trying to put my source here, but it won't work in the comment box. Is there any other way to show what I've done? – Michael Apr 08 '14 at 12:22
1

@Michael I created a [chat room](http://chat.stackoverflow.com/rooms/50506/how-to-efficiently-eliminate-duplicate-elements-in-double-array-in-c) Hope it will work. I will sample it later on, write whatever you have and I will read it all and respond when I have the time (today) – amit Apr 08 '14 at 12:42
It was all my fault; had the condition wrong. amit's code works like a charm, in a sense that only the smaller value of two values falling within the threshold is kept. It should therefore be a good basis for the extension to an iterative procedure and a replacement of both values by their mean. – Michael Apr 09 '14 at 06:30

Some programmer dude · Answer 2 · 2014-04-04T08:57:50.947

Sort the array. Then loop over the array, copying to another array. If the next item compared to the current item is within the threshold, then have an inner loop to compare the current item with all remaining items, skipping all within the threshold. When you get to an item outside the threshold you have the next current item.

By deciding that the starting element to start comparing from is in a specific order, then you sidestep the problem outlined in the comments to your question. But do note that the results will be different if you change the order (sorting ascending versus sorting descending).

Michael · Accepted Answer · 2014-04-10T08:18:40.193

I have now found a solution that works for me so far, although several function calls are needed and the complexity might not be optimal:

#include <math.h>
#include <stdlib.h>

int compareDouble (const void * a, const void * b)
{
  if ( *(double*)a <  *(double*)b ) return -1;
  if ( *(double*)a == *(double*)b ) return 0;
  if ( *(double*)a >  *(double*)b ) return 1;
}

int main(void)
{
  double x[6] = {1.0,4.0,17.0,4.0,17.0,17.0};
  size_t n = sizeof(x)/sizeof(x[0]);
  const double thresh = 5.0;

  qsort(x, n, sizeof(double), compareDouble);

  size_t i = 0;
  size_t j = 0;

  while(i<=n-1)
  {
    if(i==n-1)
    {
      x[j++] = x[i];
      break;
    }
    else if(fabs(x[i]-x[i+1])>thresh)
    {
      x[j++] = x[i++];
    }
    else
    {
      x[j++] = (x[i]+x[i+1])/2;
      i+=2;
    }
  } 

  for(i=0; i<j; i++)
  {
    printf("result[i] = %.2f\n",i,x[i]);
  }
}
return 0;

Any additional comments or remarks appreciated!

`void main()` is, probably, allowed in your implementation as an extension. I suggest you use `int main(void)` which is guaranteed by the Standard to work in all implementations. — pmg, Apr 10 '14 at 08:10

How to efficiently eliminate duplicate elements in double array (in C)?

3 Answers3