I would loop over the collection inserting into a Map datastructure with the following logic:
- If the integer has not yet been inserted into the map, then insert key=integer, value=1.
- If the key exists, increment the value.
There are two Maps in Java you could use - HashMap and TreeMap - these are compared below:
HashMap vs. TreeMap
You can skip the detailed explanation a jump straight to the summary if you wish.
A HashMap is a Map which stores key-value pairs in an array. The index used for key k is:
- h.hashCode() % map.size()
Sometimes two completely different keys will end up at the same index. To solve this, each location in the array is really a linked list, which means every lookup always has to loop over the linked list and check for equality using the k.equals(other) method. Worst case, all keys get stored at the same location and the HashMap becomes an unindexed list.
As the HashMap gains more entries, the likelihood of these clashes increase, and the efficiency of the structure decreases. To solve this, when the number of entries reaches a critical point (determined by the loadFactor argument in the constructor), the structure is resized:
- A new array is allocated at about twice the current size
- A loop is run over all the existing keys
- The key's location is recomputed for the new array
- The key-value pair is inserted into the new structure
As you can see, this can become relatively expensive if there are many resizes.
This problem can be overcome if you can pre-allocate the HashMap at an appropriate size before you begin, eg map = new HashMap(input.size()*1.5). For large datasets, this can dramatically reduce memory churn.
Because the keys are essentially randomly positioned in the HashMap, the key iterator will iterate over them in a random order. Java does provide the LinkedHashMap which will iterator in the order the keys were inserted.
Performance for a HashMap:
- Given the correct size and good distribution of hashes, lookup is constant-time.
- With bad distribution, performance drops to (in the worst case) linear search - O(n).
- With bad initial sizing, performance becomes that of rehashing. I can't trivially calculate this, but it's not good.
OTOH a TreeMap stores entries in a balanced tree - a dynamic structure that is incrementally built up as key-value pairs are added. Insert is dependent on the depth of the tree (log(tree.size()), but is predictable - unlike HashMap, there are no hiatuses, and no edge conditions where performance drops.
Each insert and lookup is more expensive given a well-distributed HashMap.
Further, in order to insert the key in the tree, every key must be comparable to every other key, requiring the k.compare(other) method from the Comparable interface. Obviously, given the question is about integers, this is not a problem.
Performance for a TreeMap:
- Insert of n elements is O(n log n)
- Lookup is O(log n)
Summary
First thoughts: Dataset size:
- If small (even in the 1000's and 10,000's) it really doesn't matter on any modern hardware
- If large, to the point of causing the machine to run out of memory, then TreeMap may be the only option
- Otherwise, size is probably not be the determining factor
In this specific case, a key factor is whether the expected number of unique integers is large or small compared to the overall dataset size?
- If small, then the overall time will be dominated by key lookup in a small set, so optimization is irrelevant (you can stop here).
- If large, then the overall time will be dominated by insert, and the decision rests on more factors:
- Dataset is of known size?
- If yes: The HashMap can be pre-allocated, and so memory churn eliminated. This is especially important if the hashCode() method is expensive (not in our case)
- If no: A TreeMap provides more predictable performance and may be the better choice
- Is predictable performance with no large stalls required, eg in real-time systems or on the event thread of a GUI?
- If yes: A TreeMap provides much better predictability with no stalls
- If no: A HashMap probably provides better overall performance for the whole computation
One final point if there is not an overwhelming point from above:
- Is a sorted list of keys of value?
- If yes (eg to print a histogram): A TreeMap has already sorted the keys, and so is convenient
However, if performance is important, the only way to decide would be to implement to the Map interface, then profile both the HashMap and the TreeMap to see which is actually better in your situation. Premature optimization is the root of much evil :)