how is this algorithm counting all the unique integers?

Question

below I have an algorithm that counts all the unique integers in an ordered array, but I am not sure how it is doing it using binary search? is anyone able to explain how? thanks.

int unique(int[] a) {
    int i = 0;
    int count = 0;
    while (i < a.length) {
      i = nextIndex(a, i, a[i]);
      count++;
    }
    return count;
  }

  int nextIndex(int[] a, int l, int target) {
    int r = a.length - 1;
    while (l <= r) {
      int mid = l + (r - l) / 2;
      if (a[mid] == target) l = mid + 1;
      else r = mid - 1;
    }
    return r + 1;
  }

Have you tried to *debug* the code? For example by using a [*debugger*](https://stackoverflow.com/questions/25385173/what-is-a-debugger-and-how-can-it-help-me-diagnose-problems) to step through the code statement by statement while monitoring variables and their values? — Some programmer dude, Sep 23 '22 at 06:19
*"below I have an algorithm"*. No, below you have code that implements an algorithm. I suspect the real issue here is that you don't actually know what the algorithm is that it's implementing. — John, Sep 23 '22 at 06:22
That algorithm is only useful for arrays that are mostly duplicates, e.g. an array of 1 million elements with only 3 unique values. — user3386109, Sep 23 '22 at 08:23

Matthew Watson · Answer 1 · 2022-09-23T09:37:52.117

JonasH has posted the actual answer about how this works.

But I was interested to see how much faster (if at all) a binary search would be compared to using Distinct() or a linear search.

Here's the code I used to benchmark it:

using System;
using System.Linq;
using BenchmarkDotNet.Attributes;

namespace ConsoleApp1;

public class UnderTest
{
    public UnderTest()
    {
        const int NUM_VALUES = 1_000_000;
        const double PROBABLILITY_OF_CHANGE = 0.00001;

        _data = new int[NUM_VALUES];

        var rng = new Random(98891); // Fixed seed.

        for (int value = 0, i = 0; i < NUM_VALUES; ++i)
        {
            _data[i] = value;

            if (rng.NextDouble() <= PROBABLILITY_OF_CHANGE)
                ++value;
        }

        // Print out to prove they all return the same value.
        Console.WriteLine(usingBinarySearch(_data));
        Console.WriteLine(usingDistinct    (_data));
        Console.WriteLine(usingLinearSearch(_data));
    }

    [Benchmark]
    public void UsingBinarySearch()
    {
        usingBinarySearch(_data);
    }

    [Benchmark]
    public void UsingDistinct()
    {
        usingDistinct(_data);
    }

    [Benchmark]
    public void UsingLinearSearch()
    {
        usingLinearSearch(_data);
    }

    static int usingBinarySearch(int[] a)
    {
        int i     = 0;
        int count = 0;

        while (i < a.Length)
        {
            i = nextIndex(a, i, a[i]);
            count++;
        }

        return count;
    }

    static int nextIndex(int[] a, int l, int target)
    {
        int r = a.Length - 1;

        while (l <= r)
        {
            int mid = l + (r - l) / 2;
            if (a[mid] == target) l = mid + 1;
            else r = mid - 1;
        }
        return r + 1;
    }

    static int usingDistinct(int[] a)
    {
        return a.Distinct().Count();
    }

    static int usingLinearSearch(int[] a)
    {
        int count = 1;

        for (int i = 1; i < a.Length; i++)
        {
            count += a[i - 1] != a[i] ? 1 : 0;
        }

        return count;
    }

    readonly int[] _data;
}

For the first test run, I gave it some data where I'd expect the binary search to be significantly faster: An array of 1M ints where the probability of each element's value being larger than the previous was 0.00001 (PROBABLILITY_OF_CHANGE = 0.00001).

Using a random number generator with a fixed seed of 98891 this resulted in there only being 7 distinct values in the array, yielding the following timing results (using .NET 6.0):

|            Method |           Mean |         Error |        StdDev |
|------------------ |---------------:|--------------:|--------------:|
| UsingBinarySearch |       251.0 ns |       4.93 ns |      12.64 ns |
|     UsingDistinct | 9,341,067.8 ns | 185,888.76 ns | 294,839.27 ns |
| UsingLinearSearch | 1,607,222.2 ns |  51,565.50 ns | 146,282.75 ns |

As you might expect, the binary search is way faster for this case. Of note is that Distinct() is very slow compared to the linear search.

This isn't really a fair comparison because Distinct() will work with unsorted data while the other two algorithms require sorted data, so bear that in mind. If you have unsorted data then the overhead of sorting it for the other algorithms will make Distinct() a better choice. I leave such comparisons as the proverbial exercise for the reader...

Now let's try with PROBABLILITY_OF_CHANGE = 0.001 (resulting in 919 distinct elements):

|            Method |        Mean |      Error |     StdDev |
|------------------ |------------:|-----------:|-----------:|
| UsingBinarySearch |    93.08 us |   1.787 us |   4.831 us |
|     UsingDistinct | 9,944.12 us | 197.825 us | 356.718 us |
| UsingLinearSearch | 1,503.85 us |  28.239 us |  63.740 us |

Binary search is still significantly faster.

Now with PROBABLILITY_OF_CHANGE = 0.1 (resulting in 100,058 distinct elements):

|            Method |      Mean |     Error |    StdDev |
|------------------ |----------:|----------:|----------:|
| UsingBinarySearch |  5.541 ms | 0.1096 ms | 0.1347 ms |
|     UsingDistinct | 14.331 ms | 0.5516 ms | 1.6091 ms |
| UsingLinearSearch |  2.319 ms | 0.0422 ms | 0.0782 ms |

Now linear search is faster than binary search.

This goes to show that Distinct() is not a good way to solve this - a linear search is always better (primarily because the data is already sorted - if it wasn't then we'd have to sort it which would change the timings significantly for the other algorithms).

And using a binary search is only worth it if the number of distinct values is relatively low compared to the size of the container.

(Do note the different units output by Benchmark.Net for these runs - ns, us and ms. I realised afterwards that I forgot to specify the units so they were automatically scaled for the fastest benchmark in the run...)

An interesting observation is that the Binary search should be `O(m log n)` where m is the number of unique values. The break even point according to your numbers should be around m = 50k, giving `50k log 1M ≈ 1M`, i.e. equal to a linear search. Suggesting theory matches practice, and that overhead of the binary search is less than I thought. — JonasH, Sep 23 '22 at 09:41
It would be interesting to see the comparison with [exponential search](https://en.wikipedia.org/wiki/Exponential_search), which uses the same binary searches but with smaller right endpoints. The asymptotic worst-case performance there seems to match the faster of linear and binary search for any distribution. — kcsquared, Sep 25 '22 at 01:42

JonasH · Accepted Answer · 2022-09-23T09:50:13.333

The nextIndex method is essentially a binary search method that searches for the last occurrence of the target value, and then returns the index one larger than this. So the outer loop will iterate and increase the count the same number of times that there are unique values.

Note that I would only use such a complicated method if profiling showed the need for it. The standard method would be myValues.Distinct().Count(), that should be faster for unordered lists, and work for types other than int arrays.

Successive binary searches should have a complexity of O(m log n) where n is the total amount of items, and m the number of unique values. This suggest a linear search should be faster if you have less then log n duplicated values on average. Such a linear search could look something like :

int count = 1;
for(int i = 1; i < a.Count; i++){
    count += a[i-1] != a[i] ? 1 : 0;
}

Also keep in mind that there are constant factors at play that can affect the result. As an example, random memory accesses are more expensive than linear access, since it makes caching more difficult. So in some cases a simpler algorithm that matches the hardware better is preferred over a more complex one, and some implementations switch strategies depending on the data set.

Even simple 1,2,3,.... ordered array would be processed slower by this fancy algorithm... — Alexei Levenkov, Sep 23 '22 at 07:20
Only in case of high repetition counts can the binary search method of interest. — , Sep 23 '22 at 09:52

how is this algorithm counting all the unique integers?

2 Answers2