0

I wrote the following function that parses a big text file (around 2 GB) into a Map line by line, effectively counting the occurrences of each word. I am interested only in words (lowercase to avoid entries repetition), no punctuation or whitespaces. Execution of the following code on the big file however takes almost 3 minutes. I am wondering why and whether there is a way to speed it up.

import java.util.*;

public class Stream {

Map<String, Integer> map = new HashMap();

public void getLines() {

    try (BufferedReader fileReader = new BufferedReader(new FileReader("resources/hugeFile"))) {
        String line ;
        while ((line = fileReader.readLine()) != null) {
            String[] words = line.toLowerCase().replaceAll("[^a-z ]", "").split("\\s+");
            for (int i = 0; i < words.length; i++) {
                if (map.get(words[i]) == null) {
                    map.put(words[i], 1);
                }
                else {
                    int newValue = Integer.valueOf(String.valueOf(map.get(words[i])));
                    newValue++;
                    map.put(words[i], newValue);
                }
            }
        }

    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}
Karl
  • 223
  • 1
  • 2
  • 11
  • 2
    Would be probably better suitable for https://codereview.stackexchange.com/. – lexicore Apr 05 '18 at 16:48
  • 4
    There are many things to improve, but why do you convert integer to string and back to integer here? `Integer.valueOf(String.valueOf(map.get(words[i])))` – lexicore Apr 05 '18 at 16:50
  • 1
    I'm sure you cannot speed up this code more than 15-20% by using fastutil (instead of native hashmap) for example. Now your bottleneck is hardware, not java. Double conversion from-to String is not bottleneck - that's 100% sure. – Oleg Gritsak Apr 05 '18 at 16:52
  • The if-else construct can be replaced with something like `map.computeIfAbsent(words[i], AtomicInteger::new).incrementAndGet())`. You have some unnecessary `get`s there. – lexicore Apr 05 '18 at 16:53
  • no need to call the replaceAll since you call toLowerCase() before it. – RAZ_Muh_Taz Apr 05 '18 at 16:55
  • 1
    I think the most problematic is `line.toLowerCase().replaceAll("[^a-z ]", "").split("\\s+")`, consider rewriting it without regular expressions. – lexicore Apr 05 '18 at 16:55
  • @RAZ_Muh_Taz How so? `toLowerCase()` converts all characters to lower case, `replaceAll("[^a-z ]", "")` removes everything but `a-z` and space. How is `replaceAll` unnecessary? – lexicore Apr 05 '18 at 16:57
  • what do you gain? can't becomes cant? why is removing it helpful? it's is not the same as its so that creates an issue as well @lexicore – RAZ_Muh_Taz Apr 05 '18 at 16:59
  • 1
    @RAZ_Muh_Taz Removing `replaceAll` will give different results. Your claim "no need to call the replaceAll since you call toLowerCase() before it" is incorrect. – lexicore Apr 05 '18 at 17:05
  • @lexicore any idea what would be a computationally less taxing way of achieving the same result - removing all punctuation and numbers? – Karl Apr 05 '18 at 20:43
  • There's obviously something weird going on, code of other people who did basically the same thing ran twice as fast on very similar machines – Karl Apr 05 '18 at 20:46

1 Answers1

2

First of all, if you're serious about optimization, you have to measure the performance. Because many of the "improvements" which seem to be "improvements" may prove to bring either nothing or even worsen the performance. In many cases the compiler optimizes code better than human. So you have to do benchmarks, please see the following question on it:

How do I write a correct micro-benchmark in Java?

I'm posting two sketches of code below. These are really just sketches, to give a rough idea. I have neither tested them nor benchmarked.

One hint is that you access the map too much. You check it with map.get and then conditionally put value with map.put. You can use putIfAbsent or computeIfAbsent instead. Also the way you increment the existing value may be improved. I'd use a mutable AtomicInteger instead of immutable Integer in this case. So I'd suggest the following:

    Map<String, AtomicInteger> map = new HashMap<>();

    Consumer<String> countWords = word -> map.computeIfAbsent(word, (w) -> new AtomicInteger(0)).incrementAndGet();

    try (BufferedReader fileReader = new BufferedReader(new FileReader("resources/hugeFile"))) {
        String line;
        while ((line = fileReader.readLine()) != null) {
            splitAndConsumeWords(line, countWords);
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

Next, you used line.toLowerCase().replaceAll("[^a-z ]", "").split("\\s+") to convert the string to lowercase, retain only letters and spaces and split the string to words. I don't know it for sure without a benchmark, but I suspect that this is probably the most time-consuming operation in your code. And it is not a big deal to rewrite it without regular expressions. All you need is to iterate over the characters of string, convert them to lowercase, append to the current word or throw away. Here's how I'd do it.

I'd create an array which maps every character to its replacement. Same character for a-z or space, lowercase for A-Z. All other characters will be mapped to 0 meaning they should be thrown away:

private static char[] ONLY_LETTERS_TO_LOWERCASE = new char[65535];

static {
    ONLY_LETTERS_TO_LOWERCASE[' '] = ' ';
    for (char c = 'a'; c <= 'z'; c++) {
        ONLY_LETTERS_TO_LOWERCASE[c] = c;
    }
    for (char c = 'A'; c <= 'Z'; c++) {
        ONLY_LETTERS_TO_LOWERCASE[c] = Character.toLowerCase(c);
    }
}

Then you simply look up a replacement for each character and build words:

public static void splitAndConsumeWords(String line, Consumer<String> wordsConsumer) {

    char[] characters = line.toCharArray();
    StringBuilder sb = new StringBuilder(16);
    for (int index = 0; index < characters.length; index++) {
        char ch = characters[index];
        char replacementCh = ONLY_LETTERS_TO_LOWERCASE[ch];
        // If we encounter a space
        if (replacementCh == ' ') {
            // And there is a word in string builder
            if (sb.length() > 0) {
                // Send this word to the consumer
                wordsConsumer.accept(sb.toString());
                // Reset the string builder
                sb.setLength(0);
            }
        } else if (replacementCh != 0) {
            sb.append(replacementCh);
        }
    }
    // Send the last word to the consumer
    if (sb.length() > 0) {
        wordsConsumer.accept(sb.toString());
    }
}

An alternative to the ONLY_LETTERS_TO_LOWERCASE mapping table would be an if statement like:

        if (ch >= 'a' && ch <= 'z' || ch == ' ') {
            replacementCh = ch;
        } else if (ch >= 'A' && ch <= 'Z') {
            replacementCh = Character.toLowerCase(ch);
        }
        else {
            replacementCh = 0; 
        }

I don't know for sure what will work better, I think lookup in an array must be faster but I'm not sure. This is why you'd ultimately need benchmarking.

lexicore
  • 42,748
  • 17
  • 132
  • 221
  • Thank you for your beautiful answer. Problem is I later sort the map according to value and atomic integers don't support CompareTo method. I should try to do it with Integers I guess, but then I don't know how to write the one liner lambda you wrote for Map. – Karl Apr 06 '18 at 11:11
  • 1
    @Karl That's not an issue. You can sort using `Comparator.comparingInt(AtomicInteger::get)`. – lexicore Apr 06 '18 at 11:25
  • I turn my entries into an Array before that, like this: `Object[] a = map.entrySet().toArray(); Arrays.sort(a, new Comparator() { public int compare(Object o1, Object o2) { return ((Map.Entry) o2).getValue() .compareTo(((Map.Entry) o1).getValue()); } });` – Karl Apr 06 '18 at 11:45
  • 1
    @Karl Better use collections. You can also replace your hand-written-comparator with `Comparator.comparingInt(entry -> entry.getValue().get())`. – lexicore Apr 06 '18 at 14:10