More efficient or more modern? Reading in & Sorting A Text File With Java

Question

I've been trying to upgrade my Java skills to use more of Java 5 & Java 6. I've been playing around with some programming exercises. I was asked to read in a paragraph from a text file and output a sorted (descending) list of words and output the count of each word.

My code is below.

My questions are:

Is my file input routine the most respectful of JVM resources?
Is it possible to cut steps out in regards to reading the file contents and getting the content into a collection that can make a sorted list of words?
Am I using the Collection classes and interface the most efficient way I can?

Thanks much for any opinions. I'm just trying to have some fun and improve my programming skills.

import java.io.*;
import  java.util.*;

public class Sort
{
    public static void main(String[] args)
    {
        String   sUnsorted       = null;
        String[] saSplit         = null;

        int iCurrentWordCount    = 1;
        String currentword       = null;
        String pastword          = "";

        // Read the text file into a string
        sUnsorted = readIn("input1.txt");

        // Parse the String by white space into String array of single words
        saSplit   = sUnsorted.split("\\s+");

        // Sort the String array in descending order
        java.util.Arrays.sort(saSplit, Collections.reverseOrder());


        // Count the occurences of each word in the String array
        for (int i = 0; i < saSplit.length; i++ )
        {

            currentword = saSplit[i];

            // If this word was seen before, increase the count & print the
            // word to stdout
            if ( currentword.equals(pastword) )
            {
                iCurrentWordCount ++;
                System.out.println(currentword);
            }
            // Output the count of the LAST word to stdout,
            // Reset our counter
            else if (!currentword.equals(pastword))
            {

                if ( !pastword.equals("") )
                {

                    System.out.println("Word Count for " + pastword + ": " + iCurrentWordCount);

                }


                System.out.println(currentword );
                iCurrentWordCount = 1;

            }

            pastword = currentword;  
        }// end for loop

       // Print out the count for the last word processed
       System.out.println("Word Count for " + currentword + ": " + iCurrentWordCount);



    }// end funciton main()


    // Read The Input File Into A String      
    public static String readIn(String infile)
    {
        String result = " ";

        try
        {
            FileInputStream file = new FileInputStream (infile);
            DataInputStream in   = new DataInputStream (file);
            byte[] b             = new byte[ in.available() ];

            in.readFully (b);
            in.close ();

            result = new String (b, 0, b.length, "US-ASCII");

        }
        catch ( Exception e )
        {
            e.printStackTrace();
        }

        return result;
    }// end funciton readIn()

}// end class Sort()

/////////////////////////////////////////////////
//  Updated Copy 1, Based On The Useful Comments
//////////////////////////////////////////////////

import java.io.*;
import java.util.*;

public class Sort2
{
    public static void main(String[] args) throws Exception
    {
        // Scanner will tokenize on white space, like we need
        Scanner scanner               = new Scanner(new FileInputStream("input1.txt"));
        ArrayList <String> wordlist   = new  ArrayList<String>();
        String currentword            = null;   
        String pastword               = null;
        int iCurrentWordCount         = 1;       

        while (scanner.hasNext())
            wordlist.add(scanner.next() );

        // Sort in descending natural order
        Collections.sort(wordlist);
        Collections.reverse(wordlist);

        for ( String temp : wordlist )
        {
            currentword = temp;

            // If this word was seen before, increase the count & print the
            // word to stdout
            if ( currentword.equals(pastword) )
            {
                iCurrentWordCount ++;
                System.out.println(currentword);
            }
            // Output the count of the LAST word to stdout,
            // Reset our counter
            else //if (!currentword.equals(pastword))
            {
                if ( pastword != null )
                    System.out.println("Count for " + pastword + ": " +  
                                                            CurrentWordCount);   

                System.out.println(currentword );
                iCurrentWordCount = 1;    
            }

            pastword = currentword;  
        }// end for loop

        System.out.println("Count for " + currentword + ": " + iCurrentWordCount);

    }// end funciton main()


}// end class Sort2

The first thing that stands out is your C++ background. You may get more out of the exercises if you attempt to make your solutions object-oriented, even if the questions do not specifically ask for it. Making it more object-oriented will get you thinking about how to group functionality into logical classes and hide implementation details behind more-convenient method calls. That said, time to read more of your code and address your question more directly... — Mike M, Jun 07 '11 at 16:46
your naming conventions are atrocious for modern Java. Hungarian notation that isn't even consistent isn't idiomatic to Java of any version! Directly using `Array` is frowned upon as well, there are `List` and `Set` classes that are more idiomatic as well. — , Jun 07 '11 at 16:47
Jarrod. I understand the comment about Hungarian notation. Why are the List or Set classes better than using an Array in this situation? — Steve, Jun 07 '11 at 17:10
Mike M; Your comment interests me. Can you provide an example of how the code could be more OO? I'm having trouble seeing it...which is of course, why I wrote it like I did. — Steve, Jun 07 '11 at 17:12
Your code is broken [avalaible()](http://download.oracle.com/javase/6/docs/api/java/io/InputStream.html#available%28%29) returns an estimate of the number of bytes that can be read without blocking. That could be a lot smaller than the file size. — Ishtar, Jun 07 '11 at 17:15
Also, you should declare variables at the last possible moment (ie, when needed) instead of putting them all at the top of the method. — Steve Kuo, Jun 07 '11 at 17:24
Just minor things that aren't truly necessary for solving the problem, but get you thinking in more of a Java mindset. You could have a class dedicated to reading an entire file into a String and a class that counts the occurrences of each word in a String. Again, completely overkill for this exercise, but it's more of the Java mindset you're looking for here, right? — Mike M, Jun 07 '11 at 17:26
@Steve Kuo - the one exception to that rule, for me, is the return value of a method. It makes debugging much easier if you declare the return variable as soon as you enter the method. — Mike M, Jun 07 '11 at 17:27
How does declaring variable immediately make debugging easier? — Steve Kuo, Jun 08 '11 at 04:42

score 4 · Accepted Answer · 2011-06-07T19:37:15.100

There are more idiomatic ways of reading in all the words in a file in Java. BreakIterator is a better way of reading in words from an input.
Use List<String> instead of Array in almost all cases. Array isn't technically part of the Collection API and isn't as easy to replace implementations as List, Set and Map are.
You should use a Map<String,AtomicInteger> to do your word counting instead of walking the Array over and over. AtomicInteger is mutable unlike Integer so you can just incrementAndGet() in a single operation that just happens to be thread safe. A SortedMap implementation would give you the words in order with their counts as well.
Make as many variables, even local ones final as possible. and declare them right before you use them, not at the top where their intended scope will get lost.
You should almost always use a BufferedReader or BufferedStream with an appropriate buffer size equal to a multiple of your disk block size when doing disk IO.

That said, don't concern yourself with micro optimizations until you have "correct" behavior.

score 2 · Answer 2 · edited May 23 '17 at 11:48

2

the SortedMap type might be efficient enough memory-wise to use here in the form SortedMap<String,Integer> (especially if the word counts are likely to be under 128)
you can provide customer delimiters to the Scanner type for breaking streams

Depending on how you want to treat the data, you might also want to strip punctuation or go for more advanced word isolation with a break iterator - see the java.text package or the ICU project.

Also - I recommend declaring variables when you first assign them and stop assigning unwanted null values.

To elaborate, you can count words in a map like this:

void increment(Map<String, Integer> wordCountMap, String word) {
  Integer count = wordCountMap.get(word);
  wordCountMap.put(word, count == null ? 1 : ++count);
}

Due to the immutability of Integer and the behaviour of autoboxing, this might result in excessive object instantiation for large data sets. An alternative would be (as others suggest) to use a mutable int wrapper (of which AtomicInteger is a form.)

edited May 23 '17 at 11:48

Community

1
1

answered Jun 07 '11 at 16:52

McDowell

107,573
31
204
267

+1 for OrderedMap. I was thinking a regular old HashMap, but OrderedMap would make things even easier. – Mike M Jun 07 '11 at 17:17
Hi McDowell; Using Scanner sounds like a neat idea. Maps are for storing key value pairs and I only want to get a list of single, non-paired items. Are you suggesting I use a Map for its API and just make the key & value the same String? – Steve Jun 07 '11 at 17:18
@user787832 - you can use the map to store the words (keys) and the word counts (values). – McDowell Jun 07 '11 at 17:37
2

There's a typo here. Should be SortedMap (although the link goes to the correct API page). I agree with the suggestion in another answer to use AtomicInteger for the value. You don't really need the concurrency, but AtomicIntegers are mutable (and have increment/decrement methods), whereas Integers are immutable, and you can't use primitive ints for the keys/values of a Map. – Rick Goldstein Jun 07 '11 at 17:50
To expand on what McDowell is saying: you would essentially be tallying a running total of each word and then printing all of the information at the end, rather than printing along the way. The benefit of this is that your code is more compartmentalized, so you can do whatever you want (printing or otherwise) with the totals once you have them. – Mike M Jun 07 '11 at 17:51
Hi Mike M & McDowell. At the end of my code at the top of the post I appended an updated version using the useful comments here. Using the Scanner class greatly reduced the size of the code. I don't understand why I would want to use a map. The Scanner class loop gives me unordered tokens/words so I can't count the occurence of each particular word while in the Scanner loop. I have to store the tokens, then sort and then count the words. A map will only let me have 1 copy of each key ( the word ) so that messes up counting. I can also use Collections.sort. Comments? Thanks. – Steve Jun 07 '11 at 20:09
1

@user787832 - If you use SortedMap, you only need to do one pass through the words. As you read in the next word, check to see if it's in the Map already (by calling get(word)). If it's there, you'll get an AtomicInteger, on which you can call the increment function to increase the count. If not, create a new AtomicInteger with the value 1 and add it to the Map keyed by the word. At the end, you can just walk through the entry set of the Map (which will be in order, because it's a SortedMap), and just read off the values, which give you the counts. – Rick Goldstein Jun 07 '11 at 21:39

score 0 · Answer 3 · answered Jun 07 '11 at 16:54

0

Can you use Guava for your homework assignment? Multiset handles the counting. Specifically, LinkedHashMultiset might be useful.

answered Jun 07 '11 at 16:54

djg

1,283
9
6

Hi djg; Believe it or not, this not homework. It is just me trying to spruce myself up by googling on "code kata". I was not aware of Guava. Thanks. I'm trying to stick to standard Java for the moment. – Steve Jun 07 '11 at 17:14

score 0 · Answer 4 · answered Jun 07 '11 at 16:55

Some other things you might find interesting:

To read the file you could use a BufferedReader (if it's text only).

This:

for (int i = 0; i < saSplit.length; i++ ){
    currentword = saSplit[i];
    [...]
}

Could be done using a extended for-loop (the Java-foreach), like shown here.

if ( currentword.equals(pastword) ){
    [...]
} else if (!currentword.equals(pastword)) {
    [...]
}

In your case, you can simply use a single else so the condition isn't checked again (because if the words aren't the same, they can only be different).

if ( !pastword.equals("") )

I think using length is faster here:

if (!pastword.length == 0)

For that last point, if you _are_ going to use .equals(), you should use the constant first -- i.e., `if ( "".equals( pastword ) )` -- to avoid possible `NullPointerException`s. — Mike M, Jun 07 '11 at 17:19

Mike M · Answer 5 · 2011-06-07T17:46:06.120

Input method:

Make it easier on yourself and deal directly with characters instead of bytes. For example, you could use a FileReader and possibly wrap it inside a BufferedReader. At the least, I'd suggest looking at InputStreamReader, as the implementation to change from bytes to characters is already done for you. My preference would be using Scanner.

I would prefer returning null or throwing an exception from your readIn() method. Exceptions should not be used for flow control, but, here, you're sending an important message back to the caller: the file that you provided was not valid. Which brings me to another point: consider whether you truly want to catch all exceptions, or just ones of certain types. You'll have to handle all checked exceptions, but you may want to handle them differently.

Collections:

You're really not use Collections classes, you're using an array. Your implementation seems fine, but...

There are certainly many ways of handling this problem. Your method -- sorting then comparing to last -- is O(nlogn) on average. That's certainly not bad. Look at a way of using a Map implementation (such as HashMap) to store the data you need while only traversing the text in O(n) (HashMap's get() and put() -- and presumably contains() -- methods are O(1)).

Hmm... when I wrote that I didn't notice that your output was to be in sorted order. Unfortunately, you're not going to get that below O(nlogn), but I still think using a `Map` implementation is going to be better. — Mike M, Jun 07 '11 at 17:17

More efficient or more modern? Reading in & Sorting A Text File With Java

5 Answers5