1

My program will read in a paragraph of words (stored in a text file). It will then need to do the following:

  • Print out a list of all the words (alphabetical order). For each word, print the frequency count (how many times the word appears in the entire paragraph) and the line numbers in which the word appears on (does not need to be ordered). If a word appears on a line multiple times, the line number does not need to be stored twice (the frequency count of this word will still be updated)
  • Display a list of words ordered from most frequent to least frequent.
  • The user will input a specific word. If the word is found, print out its frequency count.

Limitations: I cannot use the Collections class and I cannot store data multiple times. (e.g. Reading in words from the paragraph and storing them into a Set and an ArrayList)

Coding this won't be hard, but I can't figure out what would be the most efficient implementation since the data size could be a few paragraphs from a Wikipedia article or something. Here's my idea for now:

  • Have a Word class. This Word class will contain methods to return the word's frequency count and the lines in which the word appears on (and other relevant data).
  • The paragraph will be stored in a text file. The program will read the data line by line. Split the line into an array and read in words one by one.
  • As words are being read in from the text file, put the words into some sort of structure. If the structure does not contain the word, create a new word object.
  • If the structure already contains the word, update the frequency counter for that word.
    • I will also have a int to record down the line number. These line numbers will be updated accordingly.

This is somewhat incomplete, but it is what I'm thinking for now. The whole 'Word' class may probably be completely unnecessary, too.

Jin
  • 47
  • 2
  • 5
  • Make an ordered list where the insertion algorithm ensures those requirements. – ChiefTwoPencils May 14 '15 at 20:48
  • Then why not use a structure that automatically orders it for you? – Jin May 14 '15 at 20:50
  • Well, you could use a structure that orders it, but it's only going to insert it where it belongs. The other stuff you require like updating the line numbers and such won't be done. – ChiefTwoPencils May 14 '15 at 20:52
  • Perhaps refer to [this](http://stackoverflow.com/questions/663374/java-ordered-map). I'm still not sure how you'd handle the specific details you require *on* insertion. – ChiefTwoPencils May 14 '15 at 20:56
  • 1
    This will be difficult since you can't use any Collection classes. Your approach seems reasonable. Code it and see if it works. – Gilbert Le Blanc May 14 '15 at 20:57
  • Are you forbidden from using the `Collections` class, or the whole collections API? – Silly Freak May 14 '15 at 21:10
  • @SillyFreak Not entirely sure, my teacher just doesn't wanted us using stuff like `Collections.frequency()` – Jin May 14 '15 at 21:19
  • That gets right to the point of my original comment. A map by all means is a good way to associate the `Word`s with the string representation of the "word". But you can't just rely on a basic `put` because that simply replaces the value mapped to the key and thus fails to keep accurate data. So it seems you'll have to intervene on insertion. – ChiefTwoPencils May 14 '15 at 21:27

4 Answers4

2

You can use a simple TreeMap<String, Integer> for frequency lookups.

Lookups should be O(1), given that the words are short(i.e what you would find a normal text). If you expect lots of unsuccessful lookups (lots of searches for words that don't exist), you could prefilter using a Bloom Filter.

I'd start with a straightforward implementation, and optimize further if needed (parse the stream directly, instead of splitting each line with a separator and reiterating).

Razvan Manolescu
  • 414
  • 2
  • 10
  • 1
    This doesn't exactly solve the problem which is the reason choosing a structure is difficult. or so it seems. IOW, this simply keeps track of an ordered map of the words but only keeps track of frequency not paragraph lines and "other relevant data" which would need to happen *on* insertion requiring, as far as I can see, a custom `put` algo. – ChiefTwoPencils May 14 '15 at 21:17
2

First, you could create a class that holds the data for the occurrences and the row numbers (along with the word). This class could implement the Comparable interface, providing easy comparisons based on the word frequencies:

public class WordOccurrence implements Comparable<WordOccurrence> {

    private final String word;
    private int totalCount = 0;
    private Set<Integer> lineNumbers = new TreeSet<>();

    public WordOccurrence(String word, int firstLineNumber) {
        this.word = word;
        addOccurrence(firstLineNumber);
    }

    public final void addOccurrence(int lineNumber) {
        totalCount++;
        lineNumbers.add(lineNumber);
    }

    @Override
    public int compareTo(WordOccurrence o) {
        return totalCount - o.totalCount;
    }

    @Override
    public String toString() {
        StringBuilder lineNumberInfo = new StringBuilder("[");
        for (int line : lineNumbers) {
            if (lineNumberInfo.length() > 1) {
                lineNumberInfo.append(", ");
            }
            lineNumberInfo.append(line);
        }
        lineNumberInfo.append("]");
        return word + ", occurences: " + totalCount + ", on rows "
                + lineNumberInfo.toString();
    }
}

When reading the words from the file, it's useful to return the data in a Map<String, WordOccurrence>, mapping words into WordOccurrences. Using a TreeMap, you'll get alphabetical ordering "for free". Also, you may want to remove punctuation from the lines (e.g. using a regexp like \\p{P}) and ignore the case of the words:

public TreeMap<String, WordOccurrence> countOccurrences(String filePath)
        throws IOException {
    TreeMap<String, WordOccurrence> words = new TreeMap<>();

    File file = new File(filePath);
    BufferedReader reader = new BufferedReader(new InputStreamReader(
            new FileInputStream(file)));
    String line = null;
    int lineNumber = 0;

    while ((line = reader.readLine()) != null) {
        // remove punctuation and normalize to lower-case
        line = line.replaceAll("\\p{P}", "").toLowerCase();
        lineNumber++;
        String[] tokens = line.split("\\s+");
        for (String token : tokens) {

            if (words.containsKey(token)) {
                words.get(token).addOccurrence(lineNumber);
            } else {
                words.put(token, new WordOccurrence(token, lineNumber));
            }
        }
    }

    return words;
}

Displaying the occurrences in alphabetical order using the above code is as simple as

for (Map.Entry<String, WordOccurrence> entry :
         countOccurrences("path/to/file").entrySet()) {
        System.out.println(entry.getValue());
}

If you cannot use Collections.sort() (and a Comparator<WordOccurrence>) for sorting by occurrences, you need to write the sorting yourself. Something like this should do it:

public static void displayInOrderOfOccurrence(
        Map<String, WordOccurrence> words) {

    List<WordOccurrence> orderedByOccurrence = new ArrayList<>();

    // sort
    for (Map.Entry<String, WordOccurrence> entry : words.entrySet()) {
        WordOccurrence wo = entry.getValue();

        // initialize the list on the first round
        if (orderedByOccurrence.isEmpty()) {
            orderedByOccurrence.add(wo);
        } else {

            for (int i = 0; i < orderedByOccurrence.size(); i++) {
                if (wo.compareTo(orderedByOccurrence.get(i)) > 0) {
                    orderedByOccurrence.add(i, wo);
                    break;
                } else if (i == orderedByOccurrence.size() - 1) {
                    orderedByOccurrence.add(wo);
                    break;
                }
            }
        }
    }

    // display
    for (WordOccurrence wo : orderedByOccurence) {
        System.out.println(wo);
    }
}

Running the above code using the following test data:

Potato; orange.
Banana; apple, apple; potato.
Potato.

will produce this output:

apple, occurrences: 2, on rows [2]
banana, occurrences: 1, on rows [2]
orange, occurrences: 1, on rows [1]
potato, occurrences: 3, on rows [1, 2, 3]

potato, occurrences: 3, on rows [1, 2, 3]
apple, occurrences: 2, on rows [2]
banana, occurrences: 1, on rows [2]
orange, occurrences: 1, on rows [1]
Mick Mnemonic
  • 7,808
  • 2
  • 26
  • 30
0

you can have a structure like this one : https://gist.github.com/jeorfevre/946ede55ad93cc811cf8

/**
* 
* @author Jean-Emmanuel je@Rizze.com
*
*/
public class WordsIndex{
        HashMap<String, Word> words = new HashMap<String, Word>();

        public static void put(String word, int line, int paragraph){
            word=word.toLowerCase();

            if(words.containsKey(word)){
                Word w=words.get(word);
                w.count++;

            }else{
                //new word
                Word w = new Word();
                w.count=1;
                w.line=line;
                w.paragraph=paragraph;
                w.word=word;
                words.put(word, w);
            }



        }
    }

    public class Word{
        String word;
        int count;
        int line;
        int paragraph;
    }

enjoy

jeorfevre
  • 2,286
  • 1
  • 17
  • 27
  • @J-E Answers that contain not much more than a link to an external resource are discouraged on StackOverflow. Include the code in your answer (indented with 4 spaces so that it will be formatted propertly) instead of pointing to an external website. – Jesper May 14 '15 at 21:03
  • Thanks for tips @jesper, indead the external link is gist that I've produced now in order to solve this answer. – jeorfevre May 14 '15 at 21:04
  • Thanks for leaving the link and authorship for citation. – ChiefTwoPencils May 14 '15 at 21:05
  • 1
    @Chief Two pencils : yep (but it's a code I made now) – jeorfevre May 14 '15 at 21:07
  • @J-E There are some problems with your solution, for example `return this;` will not work in a `static` method. – Jesper May 14 '15 at 21:07
  • My bad...I actually added it back. Feel free to roll it back. – ChiefTwoPencils May 14 '15 at 21:07
  • Still a problem, the member variable `words` is not accessible from the `static` method `put`. You can fix it by removing `static`. – Jesper May 14 '15 at 21:09
  • If you want to I can write also order() in order to order and parser. Let me know. – jeorfevre May 14 '15 at 21:09
0

you can use TreeMap it is very suitable for getting the data ordered. use your word as key and the frequency as value. for example let the following is you paragraph

Java is good language Java is object oriented so I will do the following in order to store each word and its frequency

String s = "Java is good language Java is object oriented"   ; 
String strArr [] = s.split(" ") ; 
TreeMap<String, Integer> tm = new TreeMap<String, Integer>();
for(String str : strArr){
   if(tm.get(str) == null){
         tm.put(str, 1) ; 
   }else{
        int count = tm.get(str) ; 
        count+=1 ; 

   }
}

hopefully this will help you

Alaa Abuzaghleh
  • 1,023
  • 6
  • 11