This would be the fastest way:
final int[] counts = new int[1<<16];
for (char c : <your_string>)
counts[c]++;
(i've just sketched out the part which iterates over all your chars, I believe that's the easy part, and not directly related to this question).
Benchmark results
I've pitted the HashMap
approach against mine with three string lengths:
- 10
- 1,000
- 100,000
And these are the results:
Benchmark Mode Thr Cnt Sec Mean Mean error Units
testArray1 thrpt 1 5 5 6.870 0.083 ops/msec
testArray2 thrpt 1 5 5 6.720 0.374 ops/msec
testArray3 thrpt 1 5 5 3.770 0.019 ops/msec
testHashMap1 thrpt 1 5 5 1269.123 251.766 ops/msec
testHashMap2 thrpt 1 5 5 12.776 0.165 ops/msec
testHashMap3 thrpt 1 5 5 0.141 0.005 ops/msec
What do they mean? Yes, initializing a full 512K block of memory to zero is costly. But after that is paid, my array algorithm hardly even notices the thousands of characters whizzing by. The HashMap
approach, on the other hand, is much faster for very short strings, but scales dramatically worse. I guess the crossover is at about 2k string length.
I suppose it is not disputed that such character-count statistics are usually run against massive text corpora, and not stuff like your name and surname.
Of course, the performance of the array approach can be improved substantially if you can assume that not the complete UTF-16 codepoint range will be used. For example, if you use an array that accomodates only the lowest 1024 codepoints, the performance rises to 470 ops/msec.