56

Description | A Java program to read a text file and print each of the unique words in alphabetical order together with the number of times the word occurs in the text.

The program should declare a variable of type Map<String, Integer> to store the words and corresponding frequency of occurrence. Which concrete type, though? TreeMap<String, Number> or HashMap<String, Number> ?

The input should be converted to lower case.

A word does not contain any of these characters: \t\t\n]f.,!?:;\"()'

Example output |

 Word            Frequency
  a                 1
  and               5
  appearances       1
  as                1
         .
         .
         .

Remark | I know, I've seen elegant solutions to this in Perl with roughly two lines of code. However, I want to see it in Java.

Edit: Oh yeah, it be helpful to show an implementation using one of these structures (in Java).

Quinn Taylor
  • 44,553
  • 16
  • 113
  • 131
JohnZaj
  • 3,080
  • 5
  • 37
  • 51
  • Wow, is this blatantly fishing for the answer to a homework question? – Spencer Kormos Nov 19 '08 at 16:37
  • 26
    Actually, the homework question was to actually implement the two versions--it asked nothing about which data structure is better. However, in effort to maintain the lost art of semi-sincere learning, I am beating around the bush...just trying to glean some insight here! – JohnZaj Nov 19 '08 at 17:03
  • How the heck did I get six points for that comment above. I know...question for meta.. – JohnZaj Aug 15 '12 at 03:33
  • I have similar problem, everything the same, I only want to sort by values, not keys... what would be best approach? – ante.sabo Oct 01 '13 at 19:57

14 Answers14

63

TreeMap seems a no-brainer to me - simply because of the "in alphabetical order" requirement. HashMap has no ordering when you iterate through it; TreeMap iterates in the natural key order.

EDIT: I think Konrad's comment may have been suggesting "use HashMap, then sort." This is good because although we'll have N iterations initially, we'll have K <= N keys by the end due to duplicates. We might as well save the expensive bit (sorting) until the end when we've got fewer keys than take the small-but-non-constant hit of keeping it sorted as we go.

Having said that, I'm sticking to my answer for the moment: because it's the simplest way of achieving the goal. We don't really know that the OP is particularly worried about performance, but the question implies that he's concerned about the elegance and brevity. Using a TreeMap makes this incredibly brief, which appeals to me. I suspect that if performance is really an issue, there may be a better way of attacking it than either TreeMap or HashMap :)

rahul sharma
  • 505
  • 4
  • 17
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 7
    Jon, don't overestimate sorting. In my experience, sorting even gigantic data sets can be neglected. O(1) insertion beats O(logn) hands down, given enough items. But as always, this is pure speculation without profiling. – Konrad Rudolph Nov 19 '08 at 18:48
  • 1
    If you can achieve O(1) insertion, then you've got O(n) total insertion time. If you're also suggesting there's no more work involved before you iterate through, you've somehow achieved O(n) sorting of n (relatively arbitrary) items. How does that work? – Jon Skeet Nov 19 '08 at 22:33
  • Konrad: I think I now see your point. Editing... – Jon Skeet Nov 20 '08 at 06:18
  • 1
    Jon, I recognize this is ancient history ;P But here's the explanation. Hashtables actually *do* implement sorting in expected, amortized O(n) time. The proven lower bound of n log n on sort-related operations applies to the comparison model of computation. The hashtable defeats this by taking advantage of the sequential nature of random-access memory in a conventional computer. However! In practise, comparison-based sorting is almost always faster; anecdotally, a good hashtable implementation is a constant factor of 100 more expensive than a mediocre quicksort implementation. – Mark McKenna Aug 11 '11 at 14:24
  • 1
    @Mark: I'm not quite sure what you're explaining or how the sequential nature of memory can avoid N log N complexity... – Jon Skeet Aug 11 '11 at 14:28
  • Here's an edge-on answer to your second question: http://en.wikipedia.org/wiki/Radix_sort – Mark McKenna Aug 11 '11 at 14:42
  • @Mark: But surely we can't do a radix sort with strings, as we don't have integer keys. We have hash codes, but comparing those doesn't help us compare keys. – Jon Skeet Aug 11 '11 at 14:44
  • Er, let me add to that: Briefly, if I can compute a memory offset in constant time from each value to be sorted in such a way as to guarantee that if X < Y, then addr(X) < addr(Y), then I can sort a list by scanning it, writing each value into its value-derived memory address, then doing a linear scan of the output memory region. Total complexity is O(n+m), where n is the element count and m is the size of the output memory region. A hashtable is essentially a more involved version of the above that makes m smaller. – Mark McKenna Aug 11 '11 at 14:49
  • @Mark: Nope, still not following it. How are you going to compute that memory offset? `hashCode()` certainly doesn't give you anything like that. Given that for any two strings I can construct another string between the two of them, I can't see how you can compute a memory offset... – Jon Skeet Aug 11 '11 at 14:51
  • Radix sort on ASCII strings uses a radix of 256. If the strings are language-bound you can do it using the language character domain, which is usually better. Radix sort on unicode strings would have a radix of 64k, which is memory-prohibitive but still theoretically O(n), which is all I'm actually saying. – Mark McKenna Aug 11 '11 at 14:53
  • @JonSkeet let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/2375/discussion-between-mark-mckenna-and-jon-skeet) – Mark McKenna Aug 11 '11 at 14:53
  • @Mark: But the strings are still an arbitrary length, which stumps things I think. From the Wikipedia article you referenced: "Radix sort's efficiency is O(k·n) for n keys which have k or fewer digits." That's pretty significant unless you limit `k`, which we haven't here. I don't think you can just ignore the value of `k` in this case. – Jon Skeet Aug 11 '11 at 14:55
  • Let me recap. All I'm saying is that you sort in O(n) using the nature of the random access model--not that it can be done in all cases, nor that it is a good idea. To touch on the last point, k is in practise usually quite limited in problems like these (the average word length in the "average" English document is about 5 characters; so k would be about 5 on average, making this O(n) in the expected case). Also, to touch on the hashtable argument, whether your memory addresses are sort-stable depends on your choice of hashing function. hashCode() isn't suitable, but others are. – Mark McKenna Aug 11 '11 at 15:01
  • @JonSkeet How do we sort HashMap? You mean, something like this? new TreeSet(myMap.keySet()) – Jenix Jul 17 '18 at 13:43
  • @Jenix: I'm not sure what you're asking, to be honest. You don't sort a HashMap itself. – Jon Skeet Jul 17 '18 at 14:08
  • That's why I'm asking.. You said 'EDIT: I think Konrad's comment may have been suggesting "use HashMap, then sort." – Jenix Jul 17 '18 at 14:10
  • @Jenix: Right, But "use a hash map, then sort" isn't the same as "sorting a hash map". But yes, that would be using a HashMap then creating a TreeMap from it - but as I say, it's not clear that it's worth it. – Jon Skeet Jul 17 '18 at 14:49
  • Yeah that's what I'm thinking but not just one, but a few more people said so on StackOverFlow, so I had to ask you. But anyway, if TreeMap is what everyone here is saying about 'use a hash map and then sort it', then I think, as I said above, creating a new TreeSet (not TreeMap) and then iterrating through the original HashMap using keys from that TreeSet is better. Because we can use less memory and also still can access faster than TreeMap. – Jenix Jul 17 '18 at 15:56
  • @Jenix: I'd be surprised if there was much difference to be honest. I wouldn't want to guess without measuring. – Jon Skeet Jul 17 '18 at 19:47
18

TreeMap beats HashMap because TreeMap is already sorted for you.

However, you might want to consider using a more appropriate data structure, a bag. See Commons Collections - and the TreeBag class:

This has a nice optimised internal structure and API:

bag.add("big")
bag.add("small")
bag.add("big")
int count = bag.getCount("big")

EDIT: The question of HashMap vs TreeMap performance was answered by Jon - HashMap and sort may be quicker (try it!), but TreeBag is easier. The same is true for bags. There is a HashBag as well as a TreeBag. Based on the implementation (uses a mutable integer) a bag should outperform the equivalent plain map of Integer. The only way to know for sure is to test, as with any performance question.

JodaStephen
  • 60,927
  • 15
  • 95
  • 117
  • 2
    Yes, but would it be faster to sort once at the end instead of incurring the cost of keeping things sorted the whole time? – Herms Nov 19 '08 at 17:19
  • Herms, it would depend on the efficiency of the datastructures involved. In-place sorting an array using random-pivot quicksort is a *FAAAST* n log n, but Java's TreeMap is an also pretty fast n log n. Java's HashMap works at a pretty slow n. Since we need random access to the elements we are choosing between a fast O(n log t) once, and a slow O(n) once plus a really fast O(t log t); where t < n. The latter would eventually surpass the former as n got large but I think n would have to get *very* large first. For Google, hash+sort would be better; for school, probably treemap. – Mark McKenna Aug 11 '11 at 14:33
11

I see quite a few people saying "TreeMap look-up takes O(n log n)"!! How come?

I don't know how it has been implemented but in my head it takes O(log n).

This is because look-up in a tree can be done in O(log n). You don't sort the entire tree every time you insert an item in it. That's the whole idea of using a tree!

Hence, going back to the original question, the figures for comparison turn out to be:

HashMap approach: O(n + k log k) average case, worst case could be much larger

TreeMap approach: O(k + n log k) worst case

where n = number of words in the text , k = number of distinct words in the text.

Mallikarjuna Reddy
  • 1,212
  • 2
  • 20
  • 33
saurabh
  • 111
  • 1
  • 3
3
import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.ObjectInputStream.GetField;
import java.util.Iterator;
import java.util.Map;
import java.util.StringTokenizer;
import java.util.TreeMap;

public class TreeMapExample {

    public static void main (String args[]){
        Map<String,Integer> tm = new TreeMap<String,Integer>();
        try {

            FileInputStream fis = new FileInputStream("Test.txt");
            DataInputStream in = new DataInputStream(fis);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String line;
            int countValue = 1;
            while((line = br.readLine())!= null ){
                line = line.replaceAll("[-+.^:;,()\"\\[\\]]","");
                StringTokenizer st = new StringTokenizer(line, " ");    
                while(st.hasMoreTokens()){
                    String nextElement = (String) st.nextElement();

                    if(tm.size()>0 && tm.containsKey(nextElement)){
                        int val = 0;
                        if(tm.get(nextElement)!= null){
                        val = (Integer) tm.get(nextElement);
                        val = val+1;
                        }
                        tm.put(nextElement, val);
                    }else{
                    tm.put(nextElement, 1);
                    }

                }
            }
            for(Map.Entry<String,Integer> entry : tm.entrySet()) {
            System.out.println(entry.getKey() + " : " + entry.getValue());
            }

        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

}
Balu
  • 31
  • 2
3

Hash map should be much faster. You should not choose a container based on how you want the items to be arranged eventually; Just sort the list of (word, frequency)-pairs at the end. There will usually be less such pairs to be sorted than words in the files, so asymptotic (and real) performance with a hash map will be better.

  • Not necessarily. In the worst case, hash tables have O(n) access time. Whereas self-balancing binary search trees are always O(log n) even in worst case. – newacct Apr 28 '09 at 17:53
  • 2
    A hashtable really only experiences worst-case performance with poor loading factors and/or a poor hashing function. Good enough hashes and good enough factors are available by default in Java, so this isn't a realistic worry. – Mark McKenna Aug 11 '11 at 14:35
2

You can't assign a TreeMap<String,Number> to a variable with the type Map<String,Integer>. Double, Long, etc. can be "put" into a TreeMap<String,Number>. When I "get" a value from a Map<String,Integer>, it must be an Integer.

Completely ignoring any i18n issues, memory constraints, and error handling, here goes:

class Counter {

  public static void main(String... argv)
    throws Exception
  {
    FileChannel fc = new FileInputStream(argv[0]).getChannel();
    ByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
    CharBuffer cb = Charset.defaultCharset().decode(bb);
    Pattern p = Pattern.compile("[^ \t\r\n\f.,!?:;\"()']+");
    Map<String, Integer> counts = new TreeMap<String, Integer>();
    Matcher m = p.matcher(cb);
    while (m.find()) {
      String word = m.group();
      Integer count = counts.get(word);
      count = (count == null) ? 1 : count + 1;
      counts.put(word, count);
    }
    fc.close();
    for (Map.Entry<String, Integer> e : counts.entrySet()) {
      System.out.printf("%s: %d%n", e.getKey(), e.getValue());
    }
  }

}
erickson
  • 265,237
  • 58
  • 395
  • 493
  • Yes, you're right--I'll go back and edit that. What I mean to say is declare it as type Map instead of Map, then TreeMap is a subtype of Map. – JohnZaj Nov 19 '08 at 19:12
2

"When a key already exists it has the same performance as a HashMap." - That is just plain wrong. HashMap has O(1) insertion and TreeMap O(n log n). It'll take at least n log n checks to find out if it's in the table!

coderz
  • 21
  • 1
2

For this way, in my opinion, better use HashBag from Apache Commons Collections or HashMultiset from Guava or HashBag from Eclipse Collections (formaly GS Collections) or any following classes:

    Order    |  Guava           |   Apache  | Eclipse(GS) | JDK analog
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Not define   | HashMultiset     |   HashBag | HashBag     | HashMap<String, Integer>
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Sorted       | TreeMultiset     |   TreeBag | TreeBag     | TreeMap<String, Integer>
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Linked       |LinkedHashMultiset|     -     |     -       | LinkedHashMap<String, Integere>
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Concurrent & | ConcurrentHash-  |Synchroniz-|Synchroniz-  | Collections.synchronizedMap(
not define   | Multiset         |   edBag   | edBag       |       HashMap<String, Integer>)
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Concurrent   |         -        |Synchroniz-|Synchroniz-  | Collections.synchronizedSorted-
and sorted   |                  |edSortedBag| edSortedBag |       Map(TreeMap<>))
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Immutable and| ImmutableMultiset|Unmodifiab-|Unmodifiab-  | Collections.unmodifiableMap(
not define   |                  |   leBag   | leBag       | HashMap<String, Integer>)
─────────────┼──────────────────┼───────────┼─────────────┼─────────────
Immutable and| ImmutableSorted- |Unmodifiab-|Unmodifiab-  | Collections.unmodifiableSorted-
sorted       | Multiset         |leSortedBag| leSortedBag | Map(TreeMap<String, Integer>))
────────────────────────────────────────────────────────────────────────

Examples:

1. Using SynchronizedSortedBag from Apache:

    // Parse text to separate words
    String INPUT_TEXT = "Hello World! Hello All! Hi World!";
    // Create Multiset
    Bag bag = SynchronizedSortedBag.synchronizedBag(new TreeBag(Arrays.asList(INPUT_TEXT.split(" "))));

    // Print count words
    System.out.println(bag); // print [1:All!,2:Hello,1:Hi,2:World!]- in natural (alphabet) order
    // Print all unique words
    System.out.println(bag.uniqueSet());    // print [All!, Hello, Hi, World!]- in natural (alphabet) order


    // Print count occurrences of words
    System.out.println("Hello = " + bag.getCount("Hello"));    // print 2
    System.out.println("World = " + bag.getCount("World!"));    // print 2
    System.out.println("All = " + bag.getCount("All!"));    // print 1
    System.out.println("Hi = " + bag.getCount("Hi"));    // print 1
    System.out.println("Empty = " + bag.getCount("Empty"));    // print 0

    // Print count all words
    System.out.println(bag.size());    //print 6

    // Print count unique words
    System.out.println(bag.uniqueSet().size());    //print 4

2. Using TreeBag from Eclipse(GC):

    // Parse text to separate words
    String INPUT_TEXT = "Hello World! Hello All! Hi World!";
    // Create Multiset
    MutableSortedBag<String> bag =  TreeBag.newBag(Arrays.asList(INPUT_TEXT.split(" ")));

    // Print count words
    System.out.println(bag); // print [All!, Hello, Hello, Hi, World!, World!]- in natural order
    // Print all unique words
    System.out.println(bag.toSortedSet());    // print [All!, Hello, Hi, World!]- in natural order

    // Print count occurrences of words
    System.out.println("Hello = " + bag.occurrencesOf("Hello"));    // print 2
    System.out.println("World = " + bag.occurrencesOf("World!"));    // print 2
    System.out.println("All = " + bag.occurrencesOf("All!"));    // print 1
    System.out.println("Hi = " + bag.occurrencesOf("Hi"));    // print 1
    System.out.println("Empty = " + bag.occurrencesOf("Empty"));    // print 0

    // Print count all words
    System.out.println(bag.size());    //print 6

    // Print count unique words
    System.out.println(bag.toSet().size());    //print 4

3. Using LinkedHashMultiset from Guava:

    // Parse text to separate words
    String INPUT_TEXT = "Hello World! Hello All! Hi World!";
    // Create Multiset
    Multiset<String> multiset = LinkedHashMultiset.create(Arrays.asList(INPUT_TEXT.split(" ")));

    // Print count words
    System.out.println(multiset); // print [Hello x 2, World! x 2, All!, Hi]- in predictable iteration order
    // Print all unique words
    System.out.println(multiset.elementSet());    // print [Hello, World!, All!, Hi] - in predictable iteration order

    // Print count occurrences of words
    System.out.println("Hello = " + multiset.count("Hello"));    // print 2
    System.out.println("World = " + multiset.count("World!"));    // print 2
    System.out.println("All = " + multiset.count("All!"));    // print 1
    System.out.println("Hi = " + multiset.count("Hi"));    // print 1
    System.out.println("Empty = " + multiset.count("Empty"));    // print 0

    // Print count all words
    System.out.println(multiset.size());    //print 6

    // Print count unique words
    System.out.println(multiset.elementSet().size());    //print 4

More examples you can find in my github projects

Slava Vedenin
  • 58,326
  • 13
  • 40
  • 59
1

I would definitely choose a TreeMap:

  • TreeMap automatically sorts new keys on insertion, no sorting afterwards is needed.
  • When a key already exists it has the same performance as a HashMap.

A TreeSet internally uses a TreeMap so why not use TreeMap directly.

0

Depending on what the speed requirements are, you could also use a Trie. But there's no point in implementing one of those if a TreeMap is quick enough.

CAdaker
  • 14,385
  • 3
  • 30
  • 32
0

Here is the java example for reading a text file, sorting based on key, then upon values; depending on the number of occurrence of a words in the file.

public class SortFileWords {

    public static void main(String[] args) {
        HashMap<String, Integer> map = new HashMap<String, Integer>();
        ValueCompare vc = new ValueCompare(map);
        TreeMap<String, Integer> sorted_map = new TreeMap<String, Integer>(map);
        List<String> list = new ArrayList<>();
        Scanner sc;
        try {
            sc = new Scanner(new File("c:\\ReadMe1.txt"));
            while (sc.hasNext()) {
                list.add(sc.next());
            }
            sc.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

        for (String s : list) {
            if (map.containsKey(s)) {
                map.put(s, map.get(s) + 1);
            } else
                map.put(s, 1);
        }

        System.out.println("Unsorted map: " + map);
        sorted_map.putAll(map);
        System.out.println("Sorted map on keys: " + sorted_map);

        TreeMap<String, Integer> sorted_value_map = new TreeMap<>(vc);
        sorted_value_map.putAll(map);
        System.out.println("Sorted map on values: " + sorted_value_map);
    }
}

class ValueCompare implements Comparator<String> {

    Map<String, Integer> map;

    public ValueCompare(Map<String, Integer> map) {
        this.map = map;
    }

    @Override
    public int compare(String s1, String s2) {
        if (map.get(s1) >= map.get(s2))
            return -1;
        else
            return 1;
    }
}
tynn
  • 38,113
  • 8
  • 108
  • 143
hardeep thakur
  • 311
  • 3
  • 9
0

consider the frequency of addition or deletion to the data structure. TreeMap would not be ideal if it is high. Apart from the search for existing entry nLn it also undergoes frequent rebalancing.

on the other hand Hash structures are bit flamboyant on memory (over allocates). If you can bite that bullet then go for hash structure and sort when required.

G Kumar
  • 131
  • 1
  • 2
-2

Why not use TreeSet?

Same ordering concept as a TreeMap, except it's a Set - which, by definition, is "A collection that contains no duplicate elements".

From your problem description, it sounds as if you need a Set, I don't see what keys and values you are mapping together.

This class implements the Set interface, backed by a TreeMap instance. This class guarantees that the sorted set will be in ascending element order, sorted according to the natural order of the elements (see Comparable), or by the comparator provided at set creation time, depending on which constructor is used.

matt b
  • 138,234
  • 66
  • 282
  • 345
-3

Basically it depend on the requirement. Sometimes hash map is good sometimes treemap. but hash map is better to use only their is some constraint for overhead to sort it.