Fastest algorithm to find frequencies of each element of an array of reals?

Question

The problem is to find frequencies of each element of an array of reals.

double[] a = new double[n]
int[] freq = new int[n]

I have come up with two solution:

First solution O(n^2):

for (int i = 0; i < a.length; i++) {
  if (freq[i] != -1) {
    for (int j = i + 1; j < a.length; j++) {
      if (a[i] == a[j]) {
        freq[i]++;
        freq[j] = -1;
      }
    }
  }
}

Second solution O(nlogn):

quickSort(a, 0, a.length - 1);

freq[j] = 1;
for (int i = 0; i < a.length - 1; i++) {
  if (a[i] == a[i + 1]) {
    freq[j]++;
  }
  else {
    j = i + 1;
    freq[j] = 1;
  }
}

Is there any faster algorithm for this problem (O(n) maybe)? Thank you in advance for any help you can provide.

Checking for identity of `double`s is not a good practice. [What every programmer should know about floating points](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) — amit, Mar 23 '16 at 09:20
As a side note, this is element distinctness problem, and there is no O(n) solution under algebraic tree model. If you are going to stick with identity of doubles though, you can use a hash table, but again - that's a bad practice. — amit, Mar 23 '16 at 09:22
@amit why is it bad practice to use hash tables in cases like above? — sAm, Mar 23 '16 at 09:25
You could sort the array **O(n log n)** and than read through the array to detect how often each (allow a little difference epsilon while comparing real values) element occurs (they are now sequence) **O(n)** That makes over all time of **O(n log n)**. — MrSmith42, Mar 23 '16 at 09:28
I'd say that the problem (i.e. the OP) is **ill-posed**, since it relies of floating point equality. A similar but well-posed problem is to find the frequency of doubles in a histogram of bins of sufficient size `h`. Sufficient here means that the floating-point representations of `x` and `x+h` are ordered. `h` can, of course, depend on `x`: best is presumably to use bins of equal size in `log(|x|)`. — Walter, Mar 23 '16 at 09:59

score 10 · Answer 1 · edited May 23 '17 at 11:45

Let me start by saying that checking for identity of doubles is not a good practice. For more details see: What every programmer should know about floating points.
You should use more robust double comparisons.

Now, that we are done with that, let's face your problem.
You are dealing with variation of Element Distinctness Problem with floating points number.

Generally speaking, under the algebraic tree computation model, one cannot do it better than Omega(nlogn) (references in this thread: https://stackoverflow.com/a/7055544/572670).

However, if you are going to stick with the doubles identity checks (please don't), you can use a stronger model and hash table to achieve O(n) solution, by maintaining a hash-table based histogram (implemented as HashMap<Double,Integer>) of the elements, and when you are done, scan the histogram and yield the key of the highest value.
(Please don't do it)

There is a complex way to do achieve O(n) time based on hashing, even when dealing with floating points. This is based on adding elements to multiple entries of the hash table and assuming a hash function is taking a range of elements [x-delta/2,x+delta/2) to the same hash value (so it is hashing in chunks [x1,x2)->h1, [x2,x3)->h2, [x3,x4)->h3, ....) . You then can create a hash table where an element x will be hashed to 3 values: x-3/4delta, x, x + 3/4delta.
This guarantees when checking an equal value later, it will have a match in at least one of the 3 places you put the element.

This is significantly more complex to implement, but it should work. A variant of this can be found in cracking the code interview, mathematical, question 6. (Just make sure you look at edition 5, the answer in edition 4 is wrong and was fixed in the newer edition)

As another side note, you don't need to implement your own sort. Use Arrays.sort()

score 1 · Accepted Answer · answered Mar 23 '16 at 09:32

Your doubles have already been rounded appropriately and you are confident there isn't an error to worry about, you can use a hash map like

Map<Double, Long> freqCount = DoubleStream.of(reals).boxed()
        .collect(Collectors.groupingBy(d -> d, Collectors.counting()));

This uses quite a bit of memory, but is O(n).

The alternative is to use the following as a first pass

NavigableMap<Double, Long> freqCount =  DoubleStream.of(reals).boxed()
        .collect(Collectors.groupingBy(d -> d, TreeMap::new, Collectors.counting()));

This will count all the values which are exactly the same, and you can use a grouping strategy to combine double values which are almost the same, but should be considered equal for your purposes. This is O(N log N)

libby · Answer 3 · 2016-03-23T09:44:17.400

0

Using a Trie would perform in pretty much linear time because insertions are going to be extremely fast (or as fast as the order of your real number).

Sorting and counting is definitely way too slow if all you need is the frequencies. Your friend is the trie: https://en.wikipedia.org/wiki/Trie

If you were using a Trie, then you would convert each integer into a String (simple enough in Java). The complexity of an insertion into a Trie varies slightly based on the implementation, but in general it will be proportional to the length of the String.

If you need an implementation of a Trie, I suggest looking into Robert Sedgwick's implementation for his Algorithm's course here:

http://algs4.cs.princeton.edu/52trie/TrieST.java.html

edited Mar 23 '16 at 09:44

answered Mar 23 '16 at 09:31

libby

585
4
15

Creating the binary tree would take O(N log N) – Peter Lawrey Mar 23 '16 at 09:32
True, edited out. I think Trie is as close to as linear as you're going to get – libby Mar 23 '16 at 09:40
Unless you use a hash map. – Peter Lawrey Mar 23 '16 at 10:01

Fastest algorithm to find frequencies of each element of an array of reals?

3 Answers3