High Level Java Optimization

Question

There are many questions and answers and opinions about how to do low level Java optimization, with for, while, and do-while loops, and whether it's even necessary.

My question is more of a High Level based optimization in design. Let's assume I have to do the following:

for a given string input, count the occurrence of each letter in the string.

this is not a major problem when the string is a few sentences, but what if instead we want to count the occurrence of each word in a 900,000 word file. building loops just wastes time.

So what is the high level design pattern that can be applied to this type of problem.

I guess my major point is that I tend to use loops to solve many problems, and I would like to get out of the habit of using loops.

thanks in advance

Sam

p.s. If possible can you produce some pseudo code for solving the 900,000 word file problem, I tend to understand code better than I can understand English, which I assume is the same for most visitors of this site

The only possible solution for this without a loop is recursion but given Java does not implement any optimizations for recursion and you would have a stack overflow error, the only solution is a loop. Not sure why you think loops are wasting time. — Maurício Linhares, Aug 13 '11 at 04:29
There is nothing wrong about loops when it comes to optimization. Problems come when you design them badly or when you are concerned about maintainability and readability of the code. — Gabriel Ščerbák, Aug 13 '11 at 04:31
By getting away from loops the OP means using operations in which loops are built in -- like map, filter, and reduce -- or by writing competent Unix pipelines with awk, cut, perl -le, sort, uniq, etc. — Ray Toal, Aug 13 '11 at 04:33
@Gabriel I just think that doing a 100 loop operation is safe and fast, but when you consider a 1,000,000 loop, things get a little different. Sam — Sam Mohamed, Aug 13 '11 at 05:11
@Sam it doesn't matter how many times a loop body is performed as long as it does something useful, so 1 000, 1 000 000 or even potentially infinite are all ok. By "badly designed" loop I ment for example if you went through your file twice - once to tokenize it and second time to count the words - the result is correct, but you unnecessarily went through a file twice instead of single pass as suggested in answers. — Gabriel Ščerbák, Aug 13 '11 at 07:06

Ray Toal · Accepted Answer · 2011-08-13T15:48:27.553

The word count problem is one of the most widely covered problems in the Big Data world; it's kind of the Hello World of frameworks like Hadoop. You can find ample information throughout the web on this problem.

I'll give you some thoughts on it anyway.

First, 900000 words might still be small enough to build a hashmap for, so don't discount the obvious in-memory approach. You said pseudocode is fine, so:

h = new HashMap<String, Integer>();
for each word w picked up while tokenizing the file {
  h[w] = w in h ? h[w]++ : 1
}

Now once your dataset is too large to build an in-memory hashmap, you can do your counting like so:

Tokenize into words writing each word to a single line in a file
Use the Unix sort command to produce the next file
Count as you traverse the sorted file

These three steps go in a Unix pipeline. Let the OS do the work for you here.

Now, as you get even more data, you want to bring in map-reduce frameworks like hadoop to do the word counting on clusters of machines.

Now, I've heard when you get into obscenely large datasets, doing things in a distributed enviornment does not help anymore, because the transmission time overwhelms the counting time, and in your case of word counting, everything has to "be put back together anyway" so then you have to use some very sophisticated techniques that I suspect you can find in research papers.

ADDENDUM

The OP asked for an example of tokenizing the input in Java. Here is the easiest way:

import java.util.Scanner;
public class WordGenerator {
    /**
     * Tokenizes standard input into words, writing each word to standard output,
     * on per line.  Because it reads from standard input and writes to standard
     * output, it can easily be used in a pipeline combined with sort, uniq, and
     * any other such application.
     */
    public static void main(String[] args) {
        Scanner input = new Scanner(System.in);
        while (input.hasNext()) {
            System.out.println(input.next().toLowerCase());
        }
    } 
}

Now here is an example of using it:

echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator

This outputs

hey
moe!
woo
woo
woo
nyuk-nyuk
why
soitenly.
hey.

You can combine this tokenizer with sort and uniq like so:

echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator | sort | uniq

Yielding

hey
hey.
moe!
nyuk-nyuk
soitenly.
why
woo

Now if you only want to keep letters and throw away all punctuation, digits and other characters, change your scanner definition line to:

Scanner input = new Scanner(System.in).useDelimiter(Pattern.compile("\\P{L}"));

And now

echo -e "Hey Moe! Woo\nwoo woo^nyuk-nyuk why#2soitenly. Hey." | java WordGenerator | sort | uniq

Yields

hey
moe
nyuk
soitenly
why
woo

There is a blank line in the output; I'll let you figure out how to whack it. :)

@Ray Great answer. So I guess a pure Java solution for a large file isn't the answer. Although your solution is very platform dependent, mainly Unix. The idea is to have a cross platform design, right? I am concerned about this for an interview code test. The idea is to take a poorly written file read program and improve it. I remember a year ago I had an exam from Amazon, which I didn't pass, and one of the questions was the word count problem. — Sam Mohamed, Aug 13 '11 at 05:19
@Ray And for the in-memory approach, can you provide some Java code for file tokenziation? — Sam Mohamed, Aug 13 '11 at 05:21
@Ray you can read a file using buffer and store HashMap into file in Java easily as well, I would say the *nix solution has nothing to do at all with this question about Java. — Gabriel Ščerbák, Aug 13 '11 at 07:13
@Gabriel good point, +1 for that observation -- you can do an implementation of *nix sort in Java; it is, after all, a classic external sorting process with a well-known implementation. If this in an interview question or homework, though, the askers of this question _do_ like it when you make the jump from do-it-yourself-programming to leveraging an environment. At least IMHO. :) — Ray Toal, Aug 13 '11 at 14:18
@Sam I added a complete Java application that tokenizes to the answer. Hope it works for you. (the `Pattern`class is from `java.util.regex`). — Ray Toal, Aug 13 '11 at 15:49

score 3 · Answer 2 · answered Aug 13 '11 at 04:41

3

The fastest solution to this is O(n) AFAIK use a loop to iterate the string, get the character and update the count in a HashMap accordingly. At the end the HashMap contains all the characters that occurred and a count of all the occurrences.

Some pseduo-code (may not compile)

HashMap<Character, Integer> map = new HashMap<Character, Integer>();
for (int i = 0; i < str.length(); i++)
{
    char c = str.charAt(i);
    if (map.containsKey(c)) map.put(c, map.get(c) + 1);
    else map.put(c, 1);
}

answered Aug 13 '11 at 04:41

Jesus Ramos

22,940
10
58
88

You want `map.put(c,1)` on your last line. – Ray Toal Aug 13 '11 at 04:45
@Ray, I realized that after posting it fixed it as you can see. – Jesus Ramos Aug 13 '11 at 04:45
I see it now. Bad timing. Withdrawn. :) – Ray Toal Aug 13 '11 at 04:46
@Ray that's fine if I hadn't noticed that would have helped :) – Jesus Ramos Aug 13 '11 at 04:47

score 1 · Answer 3 · answered Aug 13 '11 at 04:45

It's hard for you to get much better than using a loop to solve this problem. IMO, the best way to speed up this sort of operation is to split the workload into different units of work and process the units of work with different processors (using threads, for example, if you have a multiprocessor computer).

Peter Lawrey · Answer 4 · 2011-08-13T18:30:28.330

You shouldn't assume 900,000 is a lot of words. If you have a CPU with 8 threads and 3 GHZ that's 24 billion clock cycles per second. ;)

However for counting characters using an int[] will be much faster. There is only 65,536 possible characters.

StringBuilder words = new StringBuilder();
Random rand = new Random();
for (int i = 0; i < 10 * 1000 * 1000; i++)
    words.append(Long.toString(rand.nextLong(), 36)).append(' ');
String text = words.toString();

long start = System.nanoTime();
int[] charCount = new int[Character.MAX_VALUE];
for (int i = 0; i < text.length(); i++)
    charCount[text.charAt(i)]++;
long time = System.nanoTime() - start;
System.out.printf("Took %,d ms to count %,d characters%n", time / 1000/1000, text.length());

prints

Took 111 ms to count 139,715,647 characters

Even 11x times the number of words takes a fraction of a second.

A much longer parallel version is a little faster.

public static void main(String... args) throws InterruptedException, ExecutionException {
    StringBuilder words = new StringBuilder();
    Random rand = new Random();
    for (int i = 0; i < 10 * 1000 * 1000; i++)
        words.append(Long.toString(rand.nextLong(), 36)).append(' ');
    final String text = words.toString();

    long start = System.nanoTime();
    // start a thread pool to generate 4 tasks to count sections of the text.
    final int nThreads = 4;
    ExecutorService es = Executors.newFixedThreadPool(nThreads);
    List<Future<int[]>> results = new ArrayList<Future<int[]>>();
    int blockSize = (text.length() + nThreads - 1) / nThreads;
    for (int i = 0; i < nThreads; i++) {
        final int min = i * blockSize;
        final int max = Math.min(min + blockSize, text.length());
        results.add(es.submit(new Callable<int[]>() {
            @Override
            public int[] call() throws Exception {
                int[] charCount = new int[Character.MAX_VALUE];
                for (int j = min; j < max; j++)
                    charCount[text.charAt(j)]++;
                return charCount;
            }
        }));
    }
    es.shutdown();
    // combine the results.
    int[] charCount = new int[Character.MAX_VALUE];
    for (Future<int[]> resultFuture : results) {
        int[] result = resultFuture.get();
        for (int i = 0, resultLength = result.length; i < resultLength; i++) {
            charCount[i] += result[i];
        }
    }
    long time = System.nanoTime() - start;
    System.out.printf("Took %,d ms to count %,d characters%n", time / 1000 / 1000, text.length());
}

prints

Took 45 ms to count 139,715,537 characters

But for a String with less than a million words its not likely to be worth it.

The usual pedantic comment: There are not 65.536 characters. Unicode, the native character set of Java, has room for _over a million characters_ and currently more than **109,000** characters have been defined. I know you probably already know this but I have this knee-jerk reaction to comment whenever I see the phrase "65,536 characters" [UTF-16 IS HARMFUL](http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful) — Ray Toal, Aug 13 '11 at 15:59
Java supports code points, and if you need to count those you need a larger array, but the approach is the same. — Peter Lawrey, Aug 13 '11 at 18:31
Yes, keeping an array of UTF-16 codepoints for counting and resolving surrogates into real outside-of-BMP characters as a post-processing step is perfectly acceptable. (I should have added a smiley to the previous rant...) :) — Ray Toal, Aug 13 '11 at 18:59

score 0 · Answer 5 · edited May 23 '17 at 12:26

As a general rule, you should just write things in a straightforward way, and then do performance tuning to make it as fast as possible. If that means putting in a faster algorithm, do so, but at first, keep it simple. For a small program like this, it won't be too hard.

The essential skill in performance tuning is not guessing. Instead, let the program itself tell you what to fix. This is my method.

For more involved programs, like this one, experience will show you how to avoid the over-thinking that ends up causing a lot of the poor performance it is trying to avoid.

score 0 · Answer 6 · answered Aug 13 '11 at 23:22

You have to use divide and conquer approach and avoid race for resources. There are different approaches and/or implementations for that. The idea is the same - split the work and parallelize the processing.

On single machine you can process chunks of the data in separate threads, although having the chunks on the same disk will slow things down considerably. H having more threads means having more context-switching, for throughput is IMHO better to have smaller amount of them and keep them busy.

You can split the processing to stages and use SEDA or something similar and with really big data you do for map-reduce - just count with the expense of distributing data across cluster.

I'll be glad of somebody point to another widely-used API.

High Level Java Optimization

6 Answers6