Most efficient way to find all modes in a List of randomly generated integers and how often they occured

Question

If a method written in C# will be passed either a null or somewhere between 0 to 6,000,000 randomly generated and unsorted integers, what is the most efficient way to determine all modes and how many times they occurred? In particular, can anyone help me with a LINQ based solution, which I'm struggling with?

Here is what I have so far:

My closest LINQ solution so far only grabs the first mode it finds and does not specify the number of occurrences. It is also about 7 times as slow on my computer as my ugly, bulky implementation, which is hideous.

    int mode = numbers.GroupBy(number => number).OrderByDescending(group => group.Count()).Select(k => k.Key).FirstOrDefault();

My manually coded method.

    public class NumberCount
    {
        public int Value;
        public int Occurrences;

        public NumberCount(int value, int occurrences)
        {
            Value = value;
            Occurrences = occurrences;
        }
    }

    private static List<NumberCount> findMostCommon(List<int> integers)
    {
        if (integers == null)
            return null;
        else if (integers.Count < 1)
            return new List<NumberCount>();

        List<NumberCount> mostCommon = new List<NumberCount>();

        integers.Sort();

        mostCommon.Add(new NumberCount(integers[0], 1));
        for (int i=1; i<integers.Count; i++)
        {
            if (mostCommon[mostCommon.Count - 1].Value != integers[i])
                mostCommon.Add(new NumberCount(integers[i], 1));
            else
                mostCommon[mostCommon.Count - 1].Occurrences++;
        }

        List<NumberCount> answer = new List<NumberCount>();
        answer.Add(mostCommon[0]);
        for (int i=1; i<mostCommon.Count; i++) 
        {
            if (mostCommon[i].Occurrences > answer[0].Occurrences)
            {
                if (answer.Count == 1)
                {
                    answer[0] = mostCommon[i];
                }
                else
                {
                    answer = new List<NumberCount>();
                    answer.Add(mostCommon[i]);
                }
            }
            else if (mostCommon[i].Occurrences == answer[0].Occurrences)
            {
                answer.Add(mostCommon[i]);
            }
        }

        return answer;        
    }

Basically, I'm trying to get an elegant, compact LINQ solution at least as fast as my ugly method. Thanks in advance for any suggestions.

Why do you think linq would do this any better or less ugly? Your code seems pretty straight-forward. Apart from this I still don´t get what you actually want to achieve. What is a mode in your case? How to you create it? — MakePeaceGreatAgain, Aug 10 '18 at 13:41
LINQ accomplishes most of it in one line. However, I'm stuck at the last bits. The mode is the number or numbers that occurred the most often. — Random User, Aug 10 '18 at 13:50
@HimBromBeere: I believe in this case "mode" = "number that occurs the maximal number of times". So in the array { 1, 3, 2, 4, 1, 4, 5 } the modes are 4 and 1 as they both occur twice, and nothing occurs more than twice. — Jon Skeet, Aug 10 '18 at 13:50
Are your random numbers within a given range? Or do you know there will be at most a small amount of distinct values? If there are only ten values for example you can create a simple collection of counts and then loop over this collection to see what is the biggest one (which is nice and quick because you only have ten things to sort/compare). If you potentially have 5,000,000 different integers this method becomes a lot less efficient... — Chris, Aug 10 '18 at 13:54
The only limitation is that they are valid non-negative 32 bit integers. — Random User, Aug 10 '18 at 13:57
I strongly suspect then that your method is going to be the way to go rather than using linq. A few points though - you waste time with `integers.Sort();`. The following code doesn't seem to require the list to be sorted. Also you might as well skip `mostCommon.Add(new NumberCount(integers[0], 1));` and start your loop at `i=0`. lastly rather than `answer = new List();` you probably can just do `answer.Clear()` to save creating multiple lists. — Chris, Aug 10 '18 at 14:07
@Chris The sort is needed for this test: `mostCommon[mostCommon.Count - 1].Value != integers[i]` so that you capture the run. — NetMage, Aug 15 '18 at 01:08
@NetMage: Oh yes. I had misread and assumed that it was a dictionary being used to store the counts rather than a list and that it would just increment the appropriate item each time. Add that to my previous comment then. :) — Chris, Aug 15 '18 at 13:42

Franck · Answer 1 · 2018-08-10T13:59:58.663

I would personally use a ConcurrentDictionary that would update a counter and dictionary are faster to access. I use this method quite a lot and it's more readable.

  // create a dictionary
  var dictionary = new ConcurrentDictionary<int, int>();

  // list of you integers
  var numbers = new List<int>();

  // parallel the iteration ( we can because concurrent dictionary is thread safe-ish
  numbers.AsParallel().ForAll((number) =>
  {
      // add the key if it's not there with value of 1 and if it's there it use the lambda function to increment by 1
      dictionary.AddOrUpdate(number, 1, (key, old) => old + 1);
  });

Then it's only a matter of getting the most occurrence there is many ways. I don't fully understand your version but the single most is only a matter of 1 aggregate like so :

var topMostOccurence = dictionary.Aggregate((x, y) => { return x.Value > y.Value ? x : y; });

Dongdong · Answer 2 · 2018-08-13T02:28:03.813

what you want: 2+ numbers could appear same times in an array, like: {1,1,1,2,2,2,3,3,3}

your current code is from here: Find the most occurring number in a List<int> but it returns a number only, it's exactly a wrong result.

The problem of Linq is: loop cannot end if you don't want it continue.

But, here I result a list with LINQ as you required:

List<NumberCount> MaxOccurrences(List<int> integers)
{
    return integers?.AsParallel()
        .GroupBy(x => x)//group numbers, key is number, count is count
        .Select(k => new NumberCount(k.Key, k.Count()))
        .GroupBy(x => x.Occurrences)//group by Occurrences, key is Occurrences, value is result
        .OrderByDescending(x => x.Key) //sort
        .FirstOrDefault()? //the first one is result
        .ToList();
}

Test details:

Array Size:30000

30000
MaxOccurrences only
MaxOccurrences1: 207
MaxOccurrences2: 38 
=============
Full List
Original1: 28
Original2: 23
ConcurrentDictionary1: 32
ConcurrentDictionary2: 34
AsParallel1: 27
AsParallel2: 19
AsParallel3: 36

ArraySize: 3000000

3000000
MaxOccurrences only
MaxOccurrences1: 3009
MaxOccurrences2: 1962 //<==this is the best one in big loop.
=============
Full List
Original1: 3200
Original2: 3234
ConcurrentDictionary1: 3391
ConcurrentDictionary2: 2681
AsParallel1: 3776
AsParallel2: 2389
AsParallel3: 2155

Here is code:

class Program
{
    static void Main(string[] args)
    {
        const int listSize = 3000000;
        var rnd = new Random();
        var randomList = Enumerable.Range(1, listSize).OrderBy(e => rnd.Next()).ToList();

        // the code that you want to measure comes here

        Console.WriteLine(randomList.Count);
        Console.WriteLine("MaxOccurrences only");

        Test(randomList, MaxOccurrences1);
        Test(randomList, MaxOccurrences2);


        Console.WriteLine("=============");
        Console.WriteLine("Full List");
        Test(randomList, Original1);
        Test(randomList, Original2);
        Test(randomList, AsParallel1);
        Test(randomList, AsParallel2);
        Test(randomList, AsParallel3);

        Console.ReadLine();
    }

    private static void Test(List<int> data, Action<List<int>> method)
    {
        var watch = System.Diagnostics.Stopwatch.StartNew();
        method(data);
        watch.Stop();
        Console.WriteLine($"{method.Method.Name}: {watch.ElapsedMilliseconds}");
    }
    private static void Original1(List<int> integers)
    {
        integers?.GroupBy(number => number)
            .OrderByDescending(group => group.Count())
            .Select(k => new NumberCount(k.Key, k.Count()))
            .ToList();
    }

    private static void Original2(List<int> integers)
    {
        integers?.GroupBy(number => number)
            .Select(k => new NumberCount(k.Key, k.Count()))
            .OrderByDescending(x => x.Occurrences)
            .ToList();
    }

    private static void AsParallel1(List<int> integers)
    {
        integers?.GroupBy(number => number)
            .AsParallel() //each group will be count by a CPU unit
            .Select(k => new NumberCount(k.Key, k.Count())) //Grap result, before sort
            .OrderByDescending(x => x.Occurrences) //sort after result
            .ToList();
    }

    private static void AsParallel2(List<int> integers)
    {
        integers?.AsParallel()
            .GroupBy(number => number)
            .Select(k => new
            {
                Key = k.Key,
                Occurrences = k.Count()
            }) //Grap result, before sort
            .OrderByDescending(x => x.Occurrences) //sort after result
            .ToList();
    }

    private static void AsParallel3(List<int> integers)
    {
        integers?.AsParallel()
            .GroupBy(number => number)
            .Select(k => new NumberCount(k.Key, k.Count())) //Grap result, before sort
            .OrderByDescending(x => x.Occurrences) //sort after result
            .ToList();
    }


    private static void MaxOccurrences1(List<int> integers)
    {
        integers?.AsParallel()
            .GroupBy(number => number)
            .GroupBy(x => x.Count())
            .OrderByDescending(x => x.Key)
            .FirstOrDefault()?
            .ToList()
            .Select(k => new NumberCount(k.Key, k.Count()))
            .ToList();
    }

    private static void MaxOccurrences2(List<int> integers)
    {
        integers?.AsParallel()
            .GroupBy(x => x)//group numbers, key is number, count is count
            .Select(k => new NumberCount(k.Key, k.Count()))
            .GroupBy(x => x.Occurrences)//group by Occurrences, key is Occurrences, value is result
            .OrderByDescending(x => x.Key) //sort
            .FirstOrDefault()? //the first one is result
            .ToList();
    }
    private static void ConcurrentDictionary1(List<int> integers)
    {
        ConcurrentDictionary<int, int> result = new ConcurrentDictionary<int, int>();

        integers?.ForEach(x => { result.AddOrUpdate(x, 1, (key, old) => old + 1); });

        result.OrderByDescending(x => x.Value).ToList();
    }
    private static void ConcurrentDictionary2(List<int> integers)
    {
        ConcurrentDictionary<int, int> result = new ConcurrentDictionary<int, int>();

        integers?.AsParallel().ForAll(x => { result.AddOrUpdate(x, 1, (key, old) => old + 1); });

        result.OrderByDescending(x => x.Value).ToList();
    }

}
public class NumberCount
{
    public int Value;
    public int Occurrences;

    public NumberCount(int value, int occurrences)
    {
        Value = value;
        Occurrences = occurrences;
    }
}

For example, {1,1,2,2,2,3,3,3} should return 2 and 3 and specify 3 occurrences of each. — Random User, Aug 10 '18 at 15:43

Evertude · Answer 3 · 2018-08-10T14:08:55.320

I tested with the code below on my Intel i7-8700K and achieved the following results:

Lambda: found 78 in 134 ms.

Manual: found 78 in 368 ms.

Dictionary: found 78 in 195 ms.

    static IEnumerable<int> GenerateNumbers(int amount)
    {
        Random r = new Random();
        for (int i = 0; i < amount; i++)
            yield return r.Next(100);
    }

    static void Main(string[] args)
    {
        var numbers = GenerateNumbers(6_000_000).ToList();

        Stopwatch sw = Stopwatch.StartNew();
        int mode = numbers.GroupBy(number => number).OrderByDescending(group => group.Count()).Select(k =>
        {
            int count = k.Count();
            return new { Mode = k.Key, Count = count };
        }).FirstOrDefault().Mode;
        sw.Stop();
        Console.WriteLine($"Lambda: found {mode} in {sw.ElapsedMilliseconds} ms.");


        sw = Stopwatch.StartNew();
        mode = findMostCommon(numbers)[0].Value;
        sw.Stop();
        Console.WriteLine($"Manual: found {mode} in {sw.ElapsedMilliseconds} ms.");

        // create a dictionary
        var dictionary = new ConcurrentDictionary<int, int>();

        sw = Stopwatch.StartNew();
        // parallel the iteration ( we can because concurrent dictionary is thread safe-ish
        numbers.AsParallel().ForAll((number) =>
        {
            // add the key if it's not there with value of 1 and if it's there it use the lambda function to increment by 1
            dictionary.AddOrUpdate(number, 1, (key, old) => old + 1);
        });
        mode = dictionary.Aggregate((x, y) => { return x.Value > y.Value ? x : y; }).Key;
        sw.Stop();
        Console.WriteLine($"Dictionary: found {mode} in {sw.ElapsedMilliseconds} ms.");


        Console.ReadLine();
    }

The Lambda and the Dictionary are only returning one mode and not specifying the numbers and how many times they occurred. — Random User, Aug 10 '18 at 15:09
Also, the Dictionary is operating on a sorted List because the same List is being used throughout. Additionally, your limiting the values to 100. When int.MaxValue is used, My results went from: max 100 and re-using numbers even though they are sorted in between : Lambda=173 Manual=477 Dictionary=210 max is int.MaxValue and each process gets a fresh array : Lambda=7276 Manual=1431 Dictionary=6381 — Random User, Aug 10 '18 at 15:24

NetMage · Accepted Answer · 2019-06-21T19:07:07.237

Different code is more efficient for differing lengths, but as the length approaches 6 million, this approach seems fastest. In general, LINQ is not for improving the speed of code, but the understanding and maintainability, depending on how you feel about functional programming styles.

Your code is fairly fast, and beats the simple LINQ approaches using GroupBy. It gains a good advantage from using the fact that List.Sort is highly optimized, and my code uses that as well, but on a local copy of the list to avoid changing the source. My code is similar in approach to yours, but is designed around a single pass doing all the computation needed. It uses an extension method I re-optimized for this problem, called GroupByRuns, that returns an IEnumerable<IGrouping<T,T>>. It also is hand expanded rather than fall back on the generic GroupByRuns that takes extra arguments for key and result selection. Since .Net doesn't have an end user accessible IGrouping<,> implementation (!), I rolled my own that implements ICollection to optimize Count().

This code runs about 1.3x as fast as yours (after I slightly optimized yours by 5%).

First, the RunGrouping class to return a group of runs:

public class RunGrouping<T> : IGrouping<T, T>, ICollection<T> {
    public T Key { get; }
    int Count;

    int ICollection<T>.Count => Count;
    public bool IsReadOnly => true;

    public RunGrouping(T key, int count) {
        Key = key;
        Count = count;
    }

    public IEnumerator<T> GetEnumerator() {
        for (int j1 = 0; j1 < Count; ++j1)
            yield return Key;
    }

    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();

    public void Add(T item) => throw new NotImplementedException();
    public void Clear() => throw new NotImplementedException();
    public bool Contains(T item) => Count > 0 && EqualityComparer<T>.Default.Equals(Key, item);
    public void CopyTo(T[] array, int arrayIndex) => throw new NotImplementedException();
    public bool Remove(T item) => throw new NotImplementedException();
}

Second, the extension method on IEnumerable that groups the runs:

public static class IEnumerableExt {
    public static IEnumerable<IGrouping<T, T>> GroupByRuns<T>(this IEnumerable<T> src) {
        var cmp = EqualityComparer<T>.Default;
        bool notAtEnd = true;
        using (var e = src.GetEnumerator()) {
            bool moveNext() {
                return notAtEnd;
            }
            IGrouping<T, T> NextRun() {
                var prev = e.Current;
                var ct = 0;
                while (notAtEnd && cmp.Equals(e.Current, prev)) {
                    ++ct;
                    notAtEnd = e.MoveNext();
                }
                return new RunGrouping<T>(prev, ct);
            }

            notAtEnd = e.MoveNext();
            while (notAtEnd)
                yield return NextRun();
        }
    }
}

Finally, the extension method that finds the max count modes. Basically it goes through the runs and keeps a record of those int with the current longest run count.

public static class IEnumerableIntExt {
    public static IEnumerable<KeyValuePair<int, int>> MostCommon(this IEnumerable<int> src) {
        var mysrc = new List<int>(src);
        mysrc.Sort();
        var maxc = 0;
        var maxmodes = new List<int>();
        foreach (var g in mysrc.GroupByRuns()) {
            var gc = g.Count();

            if (gc > maxc) {
                maxmodes.Clear();
                maxmodes.Add(g.Key);
                maxc = gc;
            }
            else if (gc == maxc)
                maxmodes.Add(g.Key);
        }

        return maxmodes.Select(m => new KeyValuePair<int, int>(m, maxc));
    }
}

Given an existing random list of integers rl, you can get the answer with:

var ans = rl.MostCommon();

I tried your code, it's wonderful! Linq is 3 times slower than your code. but I wonder if there is a way to avoid full-sort(I mean sorting full list) in Linq methods? — Dongdong, Aug 13 '18 at 14:06
@Dongdong There is, but it would be slower because `List.Sort` is optimized C++ code and faster than even one-pass LINQ code by itself. — NetMage, Aug 13 '18 at 17:30

score 0 · Answer 5 · answered Aug 14 '18 at 02:16

So far, Netmage's is the fastest I've found. The only thing I have been able to make that can beat it (at least with a valid range of 1 to 500,000,000) will only work with arrays ranging from 1 to 500,000,000 or smaller in value on my computer because I only have 8 GB of RAM. This prevents me from testing it with the full 1 to int.MaxValue range and I suspect it will fall behind in terms of speed at that size as it appears to struggle more and more with larger ranges. It uses the values as indexes and the value at those indexes as the occurrences. With 6 million randomly generated positive 16 bit integers, it is about 20 times as fast as my original method with both in Release Mode. It is only about 1.6 times as fast with 32 bit integers ranging from 1 to 500,000,000.

    private static List<NumberCount> findMostCommon(List<int> integers)
    {
        List<NumberCount> answers = new List<NumberCount>();

        int[] mostCommon = new int[_Max];

        int max = 0;
        for (int i = 1; i < integers.Count; i++)
        {
            int iValue = integers[i];
            mostCommon[iValue]++;
            int intVal = mostCommon[iValue];
            if (intVal > 1)
            {
                if (intVal > max)
                {
                    max++;
                    answers.Clear();
                    answers.Add(new NumberCount(iValue, max));
                }
                else if (intVal == max)
                {
                    answers.Add(new NumberCount(iValue, max));
                }
            }
        }

        if (answers.Count < 1)
            answers.Add(new NumberCount(0, -100)); // This -100 Occurrecnces value signifies that all values are equal.

        return answers;
    }

Perhaps a branching like this would be optiomal:

if (list.Count < sizeLimit) 
    answers = getFromSmallRangeMethod(list);
else 
    answers = getFromStandardMethod(list);

replace `int[] mostCommon = new int[_Max];` with `Dictionary mostCommon = new Dictionary();`, change `mostCommon[iValue]++;` to `if(!mostCommon.ContainsKey(iValue)) { mostCommon[iValue] = 1; } else { mostCommon[iValue]++; }` — Dongdong, Aug 14 '18 at 16:44

Most efficient way to find all modes in a List of randomly generated integers and how often they occured

5 Answers5