First of all, if you're serious about optimization, you have to measure the performance. Because many of the "improvements" which seem to be "improvements" may prove to bring either nothing or even worsen the performance. In many cases the compiler optimizes code better than human. So you have to do benchmarks, please see the following question on it:
How do I write a correct micro-benchmark in Java?
I'm posting two sketches of code below. These are really just sketches, to give a rough idea. I have neither tested them nor benchmarked.
One hint is that you access the map too much. You check it with map.get
and then conditionally put value with map.put
. You can use putIfAbsent
or computeIfAbsent
instead. Also the way you increment the existing value may be improved. I'd use a mutable AtomicInteger
instead of immutable Integer
in this case. So I'd suggest the following:
Map<String, AtomicInteger> map = new HashMap<>();
Consumer<String> countWords = word -> map.computeIfAbsent(word, (w) -> new AtomicInteger(0)).incrementAndGet();
try (BufferedReader fileReader = new BufferedReader(new FileReader("resources/hugeFile"))) {
String line;
while ((line = fileReader.readLine()) != null) {
splitAndConsumeWords(line, countWords);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
Next, you used line.toLowerCase().replaceAll("[^a-z ]", "").split("\\s+")
to convert the string to lowercase, retain only letters and spaces and split the string to words. I don't know it for sure without a benchmark, but I suspect that this is probably the most time-consuming operation in your code. And it is not a big deal to rewrite it without regular expressions. All you need is to iterate over the characters of string, convert them to lowercase, append to the current word or throw away. Here's how I'd do it.
I'd create an array which maps every character to its replacement. Same character for a-z
or space, lowercase for A-Z
. All other characters will be mapped to 0
meaning they should be thrown away:
private static char[] ONLY_LETTERS_TO_LOWERCASE = new char[65535];
static {
ONLY_LETTERS_TO_LOWERCASE[' '] = ' ';
for (char c = 'a'; c <= 'z'; c++) {
ONLY_LETTERS_TO_LOWERCASE[c] = c;
}
for (char c = 'A'; c <= 'Z'; c++) {
ONLY_LETTERS_TO_LOWERCASE[c] = Character.toLowerCase(c);
}
}
Then you simply look up a replacement for each character and build words:
public static void splitAndConsumeWords(String line, Consumer<String> wordsConsumer) {
char[] characters = line.toCharArray();
StringBuilder sb = new StringBuilder(16);
for (int index = 0; index < characters.length; index++) {
char ch = characters[index];
char replacementCh = ONLY_LETTERS_TO_LOWERCASE[ch];
// If we encounter a space
if (replacementCh == ' ') {
// And there is a word in string builder
if (sb.length() > 0) {
// Send this word to the consumer
wordsConsumer.accept(sb.toString());
// Reset the string builder
sb.setLength(0);
}
} else if (replacementCh != 0) {
sb.append(replacementCh);
}
}
// Send the last word to the consumer
if (sb.length() > 0) {
wordsConsumer.accept(sb.toString());
}
}
An alternative to the ONLY_LETTERS_TO_LOWERCASE
mapping table would be an if
statement like:
if (ch >= 'a' && ch <= 'z' || ch == ' ') {
replacementCh = ch;
} else if (ch >= 'A' && ch <= 'Z') {
replacementCh = Character.toLowerCase(ch);
}
else {
replacementCh = 0;
}
I don't know for sure what will work better, I think lookup in an array must be faster but I'm not sure. This is why you'd ultimately need benchmarking.