2

In Java, I need an algorithm to find the max. number of occurrences in a collection of integers. For example, if my set is [2,4,3,2,2,1,4,2,2], the algorithm needs to output 5 because 2 is the mostly occurring integer and it appears 5 times. Consider it like finding the peak of the histogram of the set of integers.

The challenge is, I have to do it one by one for multiple sets of many integers so it needs to be efficient. Also, I do not know which element will be mostly appearing in the sets. It is totally random.

I thought about putting those values of the set into an array, sorting it and then iterating over the array, counting consecutive appearances of the numbers and identifying the maximum of the counts but I am guessing it will take a huge time. Are there any libraries or algorithms that could help me do it efficiently?

j0k
  • 22,600
  • 28
  • 79
  • 90
Erol
  • 6,478
  • 5
  • 41
  • 55
  • 1
    To avoid confusion, you may want to change the word "set" to "collection", because the term "set" has a specific meaning in Java, and no duplicates are allowed with Java sets. Consider using a `HashMap` to hold the number (the key) and its frequency (the value). – Hovercraft Full Of Eels May 21 '12 at 01:26
  • Are you the one who actually inserts the values into the set of integers? or are they coming from some external source. – Hunter McMillen May 21 '12 at 01:27
  • I indicated it explicitly inside parenthesis but I will do it just in case. – Erol May 21 '12 at 01:27
  • possible duplicate of [Finding the mode of an array](http://stackoverflow.com/questions/8936067/finding-the-mode-of-an-array) – QuantumMechanic May 21 '12 at 01:28
  • They are calculated by an external source so they are totally random. – Erol May 21 '12 at 01:28
  • My initial thought was that you could put the values in a map with the integer value being the key and the count being the value, but that would require N insertions into the map then you would have to iterate over the map (N again) to see which one has the highest count. I think that sorting would be a good way to go. – Hunter McMillen May 21 '12 at 01:30
  • @QuantumMechanic, if you read my question carefully, I am not aiming for the mode, I am aiming for how many times the mode is appearing. – Erol May 21 '12 at 01:33
  • @babatenor That _is_ the mode. – Hunter McMillen May 21 '12 at 01:36
  • @HunterMcMillen: it's slightly different, but one hopes knowing the mode makes finding the count easier... – sarnold May 21 '12 at 01:36
  • @baba then you should look at @QuantumMechanic's link. Try the [accepted answer's suggestion](http://stackoverflow.com/a/8936098/1210260) about `HashMap` – gobernador May 21 '12 at 01:46

7 Answers7

3

What's wrong with sorting? That's O(n log n), which isn't bad at all. Any better solution would either require more information about the input sets (an upper bound on the numbers involved, perhaps) or involve a Map<Integer, Integer> or something equivalent.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
3

I would loop over the collection inserting into a Map datastructure with the following logic:

  • If the integer has not yet been inserted into the map, then insert key=integer, value=1.
  • If the key exists, increment the value.

There are two Maps in Java you could use - HashMap and TreeMap - these are compared below:

HashMap vs. TreeMap

You can skip the detailed explanation a jump straight to the summary if you wish.

A HashMap is a Map which stores key-value pairs in an array. The index used for key k is:

  • h.hashCode() % map.size()

Sometimes two completely different keys will end up at the same index. To solve this, each location in the array is really a linked list, which means every lookup always has to loop over the linked list and check for equality using the k.equals(other) method. Worst case, all keys get stored at the same location and the HashMap becomes an unindexed list.

As the HashMap gains more entries, the likelihood of these clashes increase, and the efficiency of the structure decreases. To solve this, when the number of entries reaches a critical point (determined by the loadFactor argument in the constructor), the structure is resized:

  • A new array is allocated at about twice the current size
  • A loop is run over all the existing keys
    • The key's location is recomputed for the new array
    • The key-value pair is inserted into the new structure

As you can see, this can become relatively expensive if there are many resizes.

This problem can be overcome if you can pre-allocate the HashMap at an appropriate size before you begin, eg map = new HashMap(input.size()*1.5). For large datasets, this can dramatically reduce memory churn.

Because the keys are essentially randomly positioned in the HashMap, the key iterator will iterate over them in a random order. Java does provide the LinkedHashMap which will iterator in the order the keys were inserted.

Performance for a HashMap:

  • Given the correct size and good distribution of hashes, lookup is constant-time.
  • With bad distribution, performance drops to (in the worst case) linear search - O(n).
  • With bad initial sizing, performance becomes that of rehashing. I can't trivially calculate this, but it's not good.

OTOH a TreeMap stores entries in a balanced tree - a dynamic structure that is incrementally built up as key-value pairs are added. Insert is dependent on the depth of the tree (log(tree.size()), but is predictable - unlike HashMap, there are no hiatuses, and no edge conditions where performance drops.

Each insert and lookup is more expensive given a well-distributed HashMap.

Further, in order to insert the key in the tree, every key must be comparable to every other key, requiring the k.compare(other) method from the Comparable interface. Obviously, given the question is about integers, this is not a problem.

Performance for a TreeMap:

  • Insert of n elements is O(n log n)
  • Lookup is O(log n)

Summary

First thoughts: Dataset size:

  • If small (even in the 1000's and 10,000's) it really doesn't matter on any modern hardware
  • If large, to the point of causing the machine to run out of memory, then TreeMap may be the only option
  • Otherwise, size is probably not be the determining factor

In this specific case, a key factor is whether the expected number of unique integers is large or small compared to the overall dataset size?

  • If small, then the overall time will be dominated by key lookup in a small set, so optimization is irrelevant (you can stop here).
  • If large, then the overall time will be dominated by insert, and the decision rests on more factors:
    • Dataset is of known size?
      • If yes: The HashMap can be pre-allocated, and so memory churn eliminated. This is especially important if the hashCode() method is expensive (not in our case)
      • If no: A TreeMap provides more predictable performance and may be the better choice
    • Is predictable performance with no large stalls required, eg in real-time systems or on the event thread of a GUI?
      • If yes: A TreeMap provides much better predictability with no stalls
      • If no: A HashMap probably provides better overall performance for the whole computation

One final point if there is not an overwhelming point from above:

  • Is a sorted list of keys of value?
    • If yes (eg to print a histogram): A TreeMap has already sorted the keys, and so is convenient

However, if performance is important, the only way to decide would be to implement to the Map interface, then profile both the HashMap and the TreeMap to see which is actually better in your situation. Premature optimization is the root of much evil :)

Andrew Alcock
  • 19,401
  • 4
  • 42
  • 60
  • Thank you for the answer. What is the advantage of using TreeMap instead of HashMap in this example? – Erol May 21 '12 at 16:00
  • I have made significant edits to my post to address your question. In summary of the summary:If the dataset is small, performance is probably too close to worry about, and the TreeSet gives you the numbers in sorted order. For large datasets (millions, billions, and larger) HashMap may be better if you have memory for it, but you should profile the performance. – Andrew Alcock May 22 '12 at 01:55
  • What a great answer. Thanks a lot for your time and effort. You did not only clearly enlighten me about the solution but also answered lots of questions and ambiguities I had about HashMap vs. TreeMap. For my case, number of integers are small so I will give HashMap a try. Good job. – Erol May 22 '12 at 05:01
2
  1. The basic method is to sort the collection and then simply run through the sorted collection. (This would be done in O(nLog(n) + n) which is O(nLog(n))).

  2. If the numbers are bounded (say for example, -10000,10000) and the collection contains a lot of integers you can use a lookup table and count each element. This would take O(n + l) (O(n) for the count, O(l) to find the max element) where l is the range length (20001 in this case). As you can see, if n >> l then this would become O(n) which is better than 1, but if n << l then it's O(l) which is constant but big enough to make this unusable.

  3. Another variant of the previous is to use a HashTable instead of a lookup table. This would improve the complexity to O(n) but is not guaranteed to be faster than 2 when n>>l. The good news is that the values don't have to be bounded.

I'm not much of a java but if you need help coding these, let me know.

Samy Arous
  • 6,794
  • 13
  • 20
1

Here is a sample implementation of your program. It returns the no with most frequency and if two nos are found with max occurences, then the larger no is returned. If u want to return the frequency then change the last line of the code to "return mf".

{public int mode(int[]a,int n)
   {int i,j,f,mf=0,mv=a[0];
    for(i=0;i<n;i++)
       {f=0;
        for(j=0;j<n;j++)
           {if(a[i]==a[j])
               {f++;
               }
           }
        if(f>mf||f==mf && a[i]>mv)
           {mf=f;
            mv=a[i];
           }
       }
    return mv;        
   }

}

Jeris
  • 2,355
  • 4
  • 26
  • 38
1

This little puppy works (edited to return the frequency instead of the number):

public static int mostFrequent(int[] numbers) {
    Map<Integer, AtomicInteger> map = new HashMap<Integer, AtomicInteger>() {
        public AtomicInteger get(Object key) {
            AtomicInteger value = super.get(key);
            if (value == null) {
                value = new AtomicInteger();
                super.put((Integer) key, value);
            }
            return value;
        }

    };

    for (int number : numbers)
        map.get(number).incrementAndGet();

    List<Entry<Integer, AtomicInteger>> entries = new ArrayList<Map.Entry<Integer, AtomicInteger>>(map.entrySet());
    Collections.sort(entries, new Comparator<Entry<Integer, AtomicInteger>>() {
        @Override
        public int compare(Entry<Integer, AtomicInteger> o1, Entry<Integer, AtomicInteger> o2) {
            return o2.getValue().get() - o1.getValue().get();
        }
    });

    return entries.get(0).getValue().get(); // return the largest *frequency*

    // Use this next line instead to return the most frequent *number*
    // return entries.get(0).getKey(); 
}

AtomicInteger was chosen to avoid creating new objects with every increment, and the code reads a little cleaner.

The anonymous map class was used to centralize the "if null" code

Here's a test:

public static void main(String[] args) {
    System.out.println(mostFrequent(new int[] { 2, 4, 3, 2, 2, 1, 4, 2, 2 }));
}

Output:

5
Bohemian
  • 412,405
  • 93
  • 575
  • 722
1

Since it's a collection of integers, one can use either

  1. radix sort to sort the collection and that takes O(nb) where b is the number of bits used to represent the integers (32 or 64, if you use java's primitive integer data types), or
  2. a comparison-based sort (quicksort, merge sort, etc) and that takes O(n log n).

Notes:

  • The larger your n becomes, the more likely that radix sort will be faster than comparison-based sorts. For smaller n, you are probably better off with a comparison-based sort.
  • If you know a bound on the values in the collection, b will be even smaller than 32 (or 64) making the radix sort more desirable.
Salem Derisavi
  • 137
  • 1
  • 10
0

useing HashMap:

  import java.util.HashMap;
public class NumberCounter {

   static    HashMap<Integer,Integer> map;
   static int[] arr = {1, 2, 1, 23, 4, 5, 4, 1, 2, 3, 12, 23};
   static int max=0;

   public NumberCounter(){


         map=new HashMap<Integer, Integer>();

    }

    public static void main (String[] args)
    {
        Integer newValue=1;
        NumberCounter c=new NumberCounter();

        for(int i=0;i<arr.length;i++){
            if(map.get(arr[i])!=null) {
                newValue = map.get(arr[i]);
                newValue += 1;
                map.put(arr[i], newValue);
            }
            else
                map.put(arr[i],1);


        }

        max=map.get(arr[0]);
        for(int i=0;i<map.size();i++){
         if(max<map.get(arr[i]))
             max=map.get(arr[i]);
        }
        System.out.print(max);

    }

}
saba
  • 332
  • 2
  • 14