Finding word count in a large file and sorting the result

Question

I have a large file which contains various strings. I need to parse the file and find the word count of various words present in the file. After that i need to arrange the words in the asending order of their counts.

My approach was to parse the file and store the words in a Hashmap where word is key and count is value. Count will be updated as we go on parsing the file. Once parsing is completed, i will sort the collection based on count.

Above approach is pretty simple and doesn't take into account that the file is large.

What all changes should i bring into my approach to take care of large file?

Threads don't change the runtime complexity, but reduce the constant. — Makoto, Mar 13 '13 at 15:22
Its not a real world problem but how does that make a difference? — Lokesh, Mar 13 '13 at 15:24
@Makoto: Agreed. My task is to find and efficient and fast way to achieve this. So threads will definately speed up things. — Lokesh, Mar 13 '13 at 15:25
What I'm getting at is that you'll be able to find an efficient solution without worrying about multithreading for the time being. Be wary of early optimization - if you're going to optimize, don't do it. Unless you're an expert - don't do it *yet*. — Makoto, Mar 13 '13 at 15:26
Reading the file will be the (speed-) bottleneck. The implementation with only one thread will be the fastest. If you have enough memory to hold a `HashMap` this would be the data structure of my choice (Check for overflow to detect if `Int` is not big enough to hold your counter). — MrSmith42, Mar 13 '13 at 15:33
@Lokesh in a real world problem, optimizations that don't reduce the theoretical complexity but improve runtime in practice are usually acceptable, in homework however.. — Oren, Mar 13 '13 at 15:34
@MrSmith: I didn't get the point how single thread will be fastest. If i split the file into 10 parts and read it with 10 threads wont it be faster? Also problem with Hashmap is how to sort it based on values? — Lokesh, Mar 13 '13 at 15:38
@Lokesh: 1. Splitting the file will cost time. 2. As long as all parts of the file are on the same physical drive, parallel reading will not reduce the total time to read the data. (It might even increase the time to read it because e.g hard drives are fastest, when data is read sequentially) — MrSmith42, Mar 13 '13 at 17:24

score 1 · Answer 1 · edited May 23 '17 at 10:25

1

Don't use a HashMap if you are going to have multiple threads, use a ConcurrentHashMap instead (javadoc).

You will still have to perform some sort of checking on updating the Integer value if it's already there. See this post for more details on that process.

See this post for sorting the map after you have it populated.

edited May 23 '17 at 10:25

Community

1
1

answered Mar 13 '13 at 15:44

nattyddubbs

2,085
15
24

score 1 · Answer 2 · answered Mar 13 '13 at 16:11

So, to give you a bit more clarification to my statement in the comments:

Let's presume you have the large file. It takes N operations to read through it all in word-for-word fashion. This by far will be your bottleneck, since I/O is generally slow.

For your counting scheme, you use a Map<String, Integer>. Every word you see you place into the Map, and if you encounter a particular word more than once, you add 1. In general, the addition of a particular key-value pair is constant time (HashMap), and figuring out if you can put a new Integer in the map or not is constant as well.

So your overall runtime performance to count words in a file would be O(N) + C, where N is mostly due to I/O.

Now, let's say you use ten threads. You cut the large file up into ten chunks and let each thread insert their values into the ConcurrentHashMap. Your overall runtime complexity hasn't changed, except it has (potentially) been reduced by a factor of 10.

Your runtime with the additional threads will be O(t(1/10)N) + C, which still reduces to O(N) + C.

The only way you could get this to be more efficient is if you could change the linear scanning method employed to be more efficient than linear time.

score 1 · Accepted Answer · answered Mar 13 '13 at 16:11

First I'd use a Map to determine the word count:

    String[] words = {"one", "two", "three", "two", "three", "three"};
    Map<String, Integer> map = new HashMap<String, java.lang.Integer>();
    for (String word : words) {
        int count = 0;
        if (map.containsKey(word)) {
            count = map.get(word);
        }
        map.put(word, ++count);
    }
    System.out.println(map);
    --> output: {two=2, one=1, three=3}

Then, I'd use either a TreeMap or a new "custom" key/value class to sort by count:

Using TreeMap:

private static void sortUsingTreeMap(Map<String, Integer> map) {
    TreeMap<String, Integer> sorted = new TreeMap<String, Integer>(new MyComparator(map));
    sorted.putAll(map);
    System.out.println(sorted);
}

static class MyComparator implements Comparator<String> {
    private Map<String, Integer> map;

    MyComparator(Map<String, Integer> map) {
        this.map = map;
    }

    @Override
    public int compare(String o1, String o2) {
        return map.get(o1).compareTo(map.get(o2));
    }
}
--> output: {one=1, two=2, three=3}

Using a new key/value class:

private static void sortUsingKeyValueClass(Map<String, Integer> map) {
    class KeyValue implements Comparable<KeyValue> {
        private final Integer count;
        private final String word;

        public KeyValue(Integer count, String word) {
            this.count = count;
            this.word = word;
        }

        @Override
        public int compareTo(KeyValue o) {
            return count.compareTo(o.count);
        }

        @Override
        public String toString() {
            return word + "=" + count;
        }
    }

    List<KeyValue> keyValues = new ArrayList<KeyValue>();
    for (String word : map.keySet()) {
        keyValues.add(new KeyValue(map.get(word), word));
    }
    Collections.sort(keyValues);
    System.out.println(keyValues);
}
--> output: [one=1, two=2, three=3]

I would also add that I'd defer adding threads to the mix until I found it necessary performance wise. As others here have stated a poor implementation wont be saved by processing the results concurrently.

score 0 · Answer 4 · answered Mar 13 '13 at 15:38

As was said in the comments, threads will be useful for a tiebreaker situation where you want your solution to be just a little bit faster than someone else's solution. Threads are useless if what is running inside of them is really slow.

A hashmap will be the best for time complexity for the first part of your question.

For the second part of your question, I would use a set, a 2d array, and the data structure you used in the first part. If you parse the file a second time, adding each new word to the set and checking its word count in the hashmap that you already created, you could store each word in the index location of its word count. After that, just traverse backwards through the array and you will have the words in order of their count.

Finding word count in a large file and sorting the result

4 Answers4