What is the fastest way to gather symbol occurances in java

Question

My goal is to make a function which counts occurances of some symbols (chars) in line. An int ID is giving to every character I need to count. The set of chars is limited and I know it from the beginning. All the lines consist only of the chars from the giving set. The function processes gazzilions of lines. My profiler always shows the function which collects the stats is the slowest (97%) despite the program does a lot of other things. First I used a HashMap and code like this:

    occurances = new HashMap<>();
    for (int symbol : line) {
        Integer amount = 1;
        if (occurances.containsKey(symbol)) {
            amount += occurances.get(symbol);
        }
        occurances.put(symbol, amount);
    }

The profiler showed hashMap.put takes 97% processor usage

Then I tried to replace it with a created once ArrayList: And optimized it a litle bit (the lines are always longer than 1 char), but it's still very slow.

    int symbol = line[0];
    occurances.set(symbol, 1);

    for (int i = 1; i < length; i++) {
        symbol = line[i];
        occurances.set(symbol, 1 + occurances.get(symbol));
    }

Please if someone has some better ideas how to solve this task with better performance, your help would be very much appreceated.

The method put runs the hash method of the object that you use as the key. This is probably the reason for the "high" usage. You also need to understand that 97% does not necessarily mean that this line is a CPU hog. — Michael, Aug 07 '16 at 11:58

score 2 · Answer 1 · edited May 23 '17 at 11:44

2

As suggested here you can try to do somthing like

List<Integer> line = //get line as a list;
Map<Integer, Long> intCount = line.parallelStream()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

edited May 23 '17 at 11:44

Community

1
1

answered Aug 07 '16 at 12:07

Vladimiro Corsi

108
4

I don't think this answers the question about performance. The question is primarily about performance and not how to count integers. It seems to me that this method is probably going to be the worst performing situation due to the immense amount of garbage it is going to create. If I'm wrong, let me know. – Johnny V Aug 07 '16 at 13:12
I have a little bit more difficult structure, I have simplified the code to make the problem more clear, I actually have my chars in 2-dimentional array and should check occurences only at certain positions in the array, so I can't use this method. But thank you anyway for an option! – Shpytyack Artem Aug 07 '16 at 18:37

score 1 · Answer 2 · edited Aug 07 '16 at 13:16

1

you can convert the char directly to an int and use it as an index

for (i=0; ; i++){
    occurences[(int)line[i]]++;
}

edited Aug 07 '16 at 13:16

fabian

80,457
12
86
114

answered Aug 07 '16 at 11:53

whyn0t

301
2
14

Thank you! The char was already converted to int and i used it as index, but I used AraryList get and set, which are slower than ++ of an array – Shpytyack Artem Aug 07 '16 at 18:35
glade you found it useful =] – whyn0t Aug 08 '16 at 07:35

score 1 · Answer 3 · answered Aug 07 '16 at 11:54

Its very possible that not parameterizing HashMap it is causing lots of performance problems.

What I would do is create a class called IntegerCounter. Look at AtomicInteger (http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/concurrent/atomic/AtomicInteger.java) code and copy everything from there except the code that makes it Atomic. Using IntegerCounter and incrementing the single instance of it should save you a lot of garbage collection.

Using new Integer(x) for the key lookup should allow for escape-analysis to automatically garbage collect it.

HashMap<Integer, IntegerCounter> occurances;

// since the set of characters are already known, add all of them here with an initial count of 0

for (int i = 0; i < length; i++) {
    occurances.get(new Integer(line[i])).incrementAndGet();
}

score 1 · Answer 4 · answered Aug 07 '16 at 13:12

In your code in most loop iterations you'll lookup the entry in the Map 3 times:

1.

occurances.containsKey(symbol)

2.

occurances.get(symbol);

3.

occurances.put(symbol, amount);

This is more than needed and you can simply use the fact that get returns null to improve this to 2 lookups:

Integer currentCount = occurances.get(symbol);
Integer amount = currentCount == null ? 1 : currentCount + 1;
occurances.put(symbol, amount);

Furthermore by using Integer, new Integer objects need to be created often (as soon as they exceed 127 or the upper bound that is used for the cached values), which decreases performance.

Also since you know the character set before analyzing the data, you could insert 0s (or equivalent) as values for all characters, which removes the need to check, if a mapping is already in the map.

The following code uses uses a helper class containing a int count field to store the data instead, which allows incrementing the value without boxing/unboxing conversions.

class Container {
    public int count = 0;
}

int[] symbolSet = ...
Map<Integer, Container> occurances = new HashMap<>();
for (int s : symbolSet) {
    occurances.put(s, new Container());
}

for (int symbol : line) {
    occurances.get(symbol).count++;
}

Also using a different data structure can also help. Things that come to mind are Perfect Hashing or storing the data in a data structure different from Map. However instead of using a ArrayList, I'd recommend using a int[] array, since this does not require any method calls and also removes the need for boxing/unboxing conversions to/from Integer. The data can still be converted to a more suitable data structure after calculating the frequencies.

score 1 · Accepted Answer · answered Aug 07 '16 at 16:23

You could try something like this:

public class CharCounter {

    final int max;
    final int[] counts;

    public CharCounter(char max) {
        this.max = (int) max;
        counts = new int[this.max + 1];
    }

    public void addCounts(char[] line) {
        for (int symbol : line) {
            counts[symbol]++;
        }
    }

    public Map<Integer, Integer> getCounts() {
        Map<Integer, Integer> countsMap = new HashMap<>();
        for (int symbol = 0; symbol < counts.length; symbol++) {
            int count = counts[symbol];
            if (count > 0) {
                countsMap.put(symbol, count);
            }
        }
        return countsMap;
    }
}

This uses an array to keep the counts and uses the char itself as an index to the array.
This eliminates the need to check if a map contains the given key etc. It also removes the need for autoboxing the chars.

And a performance comparison shows roughly 20x speedup:

public static final char MIN = 'a';
public static final char MAX = 'f';

private static void count1(Map<Integer, Integer> occurrences, char[] line) {
    for (int symbol : line) {
        Integer amount = 1;
        if (occurrences.containsKey(symbol)) {
            amount += occurrences.get(symbol);
        }
        occurrences.put(symbol, amount);
    }
}

private static void count2(CharCounter counter, char[] line) {
    counter.addCounts(line);
}

public static void main(String[] args) {
    char[] line = new char[1000];
    for (int i = 0; i < line.length; i++) {
        line[i] = (char) ThreadLocalRandom.current().nextInt(MIN, MAX + 1);
    }

    Map<Integer, Integer> occurrences;
    CharCounter counter;

    // warmup
    occurrences = new HashMap<>();
    counter = new CharCounter(MAX);
    System.out.println("Start warmup ...");
    for (int i = 0; i < 500_000; i++) {
        count1(occurrences, line);
        count2(counter, line);
    }
    System.out.println(occurrences);
    System.out.println(counter.getCounts());
    System.out.println("Warmup done.");


    // original method
    occurrences = new HashMap<>();
    System.out.println("Start timing of original method ...");
    long start = System.nanoTime();
    for (int i = 0; i < 500_000; i++) {
        count1(occurrences, line);
    }
    System.out.println(occurrences);
    long duration1 = System.nanoTime() - start;
    System.out.println("End timing of original method.");
    System.out.println("time: " + duration1);


    // alternative method
    counter = new CharCounter(MAX);
    System.out.println("Start timing of alternative method ...");
    start = System.nanoTime();
    for (int i = 0; i < 500_000; i++) {
        count2(counter, line);
    }
    System.out.println(counter.getCounts());
    long duration2 = System.nanoTime() - start;
    System.out.println("End timing of alternative method.");
    System.out.println("time: " + duration2);

    System.out.println("Speedup: " + (double) duration1 / duration2);
}

Output:

Start warmup ...
{97=77000000, 98=82000000, 99=86500000, 100=86000000, 101=80000000, 102=88500000}
{97=77000000, 98=82000000, 99=86500000, 100=86000000, 101=80000000, 102=88500000}
Warmup done.
Start timing of original method ...
{97=77000000, 98=82000000, 99=86500000, 100=86000000, 101=80000000, 102=88500000}
End timing of original method.
time: 7110894999
Start timing of alternative method ...
{97=77000000, 98=82000000, 99=86500000, 100=86000000, 101=80000000, 102=88500000}
End timing of alternative method.
time: 388308432
Speedup: 18.31249185698857

Also if you add the -verbose:gc JVM flag you can see that the original method needs to do quite a bit of garbage collecting while the alternative method doesn't need any.

Changed ArrayList to array and replaced with ++ gave ~20% better performace. Now this method is 74% instead of 97%! Thank you very much! — Shpytyack Artem, Aug 07 '16 at 18:34

What is the fastest way to gather symbol occurances in java

5 Answers5