JonasH has posted the actual answer about how this works.
But I was interested to see how much faster (if at all) a binary search would be compared to using Distinct()
or a linear search.
Here's the code I used to benchmark it:
using System;
using System.Linq;
using BenchmarkDotNet.Attributes;
namespace ConsoleApp1;
public class UnderTest
{
public UnderTest()
{
const int NUM_VALUES = 1_000_000;
const double PROBABLILITY_OF_CHANGE = 0.00001;
_data = new int[NUM_VALUES];
var rng = new Random(98891); // Fixed seed.
for (int value = 0, i = 0; i < NUM_VALUES; ++i)
{
_data[i] = value;
if (rng.NextDouble() <= PROBABLILITY_OF_CHANGE)
++value;
}
// Print out to prove they all return the same value.
Console.WriteLine(usingBinarySearch(_data));
Console.WriteLine(usingDistinct (_data));
Console.WriteLine(usingLinearSearch(_data));
}
[Benchmark]
public void UsingBinarySearch()
{
usingBinarySearch(_data);
}
[Benchmark]
public void UsingDistinct()
{
usingDistinct(_data);
}
[Benchmark]
public void UsingLinearSearch()
{
usingLinearSearch(_data);
}
static int usingBinarySearch(int[] a)
{
int i = 0;
int count = 0;
while (i < a.Length)
{
i = nextIndex(a, i, a[i]);
count++;
}
return count;
}
static int nextIndex(int[] a, int l, int target)
{
int r = a.Length - 1;
while (l <= r)
{
int mid = l + (r - l) / 2;
if (a[mid] == target) l = mid + 1;
else r = mid - 1;
}
return r + 1;
}
static int usingDistinct(int[] a)
{
return a.Distinct().Count();
}
static int usingLinearSearch(int[] a)
{
int count = 1;
for (int i = 1; i < a.Length; i++)
{
count += a[i - 1] != a[i] ? 1 : 0;
}
return count;
}
readonly int[] _data;
}
For the first test run, I gave it some data where I'd expect the binary search to be significantly faster: An array of 1M ints where the probability of each element's value being larger than the previous was 0.00001
(PROBABLILITY_OF_CHANGE = 0.00001
).
Using a random number generator with a fixed seed of 98891
this resulted in there only being 7 distinct values in the array, yielding the following timing results (using .NET 6.0):
| Method | Mean | Error | StdDev |
|------------------ |---------------:|--------------:|--------------:|
| UsingBinarySearch | 251.0 ns | 4.93 ns | 12.64 ns |
| UsingDistinct | 9,341,067.8 ns | 185,888.76 ns | 294,839.27 ns |
| UsingLinearSearch | 1,607,222.2 ns | 51,565.50 ns | 146,282.75 ns |
As you might expect, the binary search is way faster for this case. Of note is that Distinct()
is very slow compared to the linear search.
This isn't really a fair comparison because Distinct()
will work with unsorted data while the other two algorithms require sorted data, so bear that in mind. If you have unsorted data then the overhead of sorting it for the other algorithms will make Distinct()
a better choice. I leave such comparisons as the proverbial exercise for the reader...
Now let's try with PROBABLILITY_OF_CHANGE = 0.001
(resulting in 919 distinct elements):
| Method | Mean | Error | StdDev |
|------------------ |------------:|-----------:|-----------:|
| UsingBinarySearch | 93.08 us | 1.787 us | 4.831 us |
| UsingDistinct | 9,944.12 us | 197.825 us | 356.718 us |
| UsingLinearSearch | 1,503.85 us | 28.239 us | 63.740 us |
Binary search is still significantly faster.
Now with PROBABLILITY_OF_CHANGE = 0.1
(resulting in 100,058 distinct elements):
| Method | Mean | Error | StdDev |
|------------------ |----------:|----------:|----------:|
| UsingBinarySearch | 5.541 ms | 0.1096 ms | 0.1347 ms |
| UsingDistinct | 14.331 ms | 0.5516 ms | 1.6091 ms |
| UsingLinearSearch | 2.319 ms | 0.0422 ms | 0.0782 ms |
Now linear search is faster than binary search.
This goes to show that Distinct()
is not a good way to solve this - a linear search is always better (primarily because the data is already sorted - if it wasn't then we'd have to sort it which would change the timings significantly for the other algorithms).
And using a binary search is only worth it if the number of distinct values is relatively low compared to the size of the container.
(Do note the different units output by Benchmark.Net for these runs - ns, us and ms. I realised afterwards that I forgot to specify the units so they were automatically scaled for the fastest benchmark in the run...)