0

So I have a .txt file which I am calling using

String[] data = loadStrings("data/data.txt");

The file is already sorted and essentially looks like:

Animal
Animal
Cat
Cat
Cat
Dog

I am looking to create an algorithm to count the sorted list in java, without using any libraries like Multisets or without the use of Maps/HashMaps. I have managed so far to get it print out the top occurring word like so:

ArrayList<String> words = new ArrayList();

int[] occurrence = new int[2000];
Arrays.sort(data);

for (int i = 0; i < data.length; i ++ ) {
words.add(data[i]);     //Put each word into the words ArrayList
}
for(int i =0; i<data.length; i++) {
 occurrence[i] =0;
 for(int j=i+1; j<data.length; j++) {
   if(data[i].equals(data[j])) {
     occurrence[i] = occurrence[i]+1;
   }
 }
}
int max = 0;
String most_talked ="";
for(int i =0;i<data.length;i++) {
  if(occurrence[i]>max) {
    max = occurrence[i];
    most_talked = data[i];
  }
 }
 println("The most talked keyword is " + most_talked + " occuring " + max + " times.");

I want rather than just to get the highest occurring word perhaps the top 5 or top 10. Hope that was clear enough. Thanks for reading

Nebbyyy
  • 358
  • 4
  • 20
  • Just create a new arraylist where you store an object for every word and the number of occurences. Then sort it descending and print the first ten out. – cansik Mar 28 '15 at 23:25

3 Answers3

1

If you cannot use Guava's Multiset, then you can implement an equivalent yourself. Basically, you just need to create a Map<String, Integer>, which keeps track of counts (value) per each word (key). This means changing this

ArrayList<String> words = new ArrayList<String>();
// ...
for (int i = 0; i < data.length; i ++ ) {
  words.add(data[i]);     //Put each word into the words ArrayList
}

into this:

Map<String, Integer> words = new HashMap<String>();
// ...
for (String word : data) {
  Integer count = words.get(word);
  words.put(word, (count != null : count.intValue() + 1 ? 1));
}

After you've filled the map, just sort it by the values.

If you cannot use a Map either, you can do the following:

First, create a wrapper class for your word counts:

public class WordCount implements Comparable<WordCount> {
    private String word;
    private int count;

    public WordCount(String w, int c) {
      this.word = w;
      this.count = c;
    }

    public String getWord() {
      return word;
    }

    public int getCount() {
      return count;
    }

    public void incrementCount() {
      count++;
    }                 

    @Override
    public int compareTo(WordCount other) {
      return this.count - other.count;
    }
}

Then, change your code to store WordCount instances in your list (instead of Strings):

ArrayList<WordCount> words = new ArrayList<WordCount>();
// ...
for (String word : data) {
    WordCount wc = new WordCount(word, 1);
    boolean wordFound = false;

    for (WordCount existing : words) {
        if (existing.getWord().equals(wc.getWord())) {
            existing.incrementCount();
            wordFound = true;
            break;
        }
    }

    if (!wordFound) {
        words.add(wc);
    }
}

Finally, after populating the List, simply sort it using Collections.sort(). This is easy because the value objects implement Comparable:

Collections.sort(words, Collections.reverseOrder());
Community
  • 1
  • 1
Mick Mnemonic
  • 7,808
  • 2
  • 26
  • 30
  • 1
    As I said I cannot use Maps nor can I use Multiset :/ – Nebbyyy Mar 28 '15 at 23:33
  • You said you cannot use _any library_ (I assume that means external library). Map is a part of Java API. – Mick Mnemonic Mar 28 '15 at 23:42
  • Sorry edited it now, Is there any way to do this via the sort / compareTo() functions, I was just told to try and do it creating your own algorithm and iterating over it – Nebbyyy Mar 28 '15 at 23:45
  • I removed the `equals()` and `hashCode()` implementations from `WordCount` because they weren't needed in this example. – Mick Mnemonic Mar 29 '15 at 00:52
1

Since you said you dont want to use some kind of data structure i think that you can do something like this, but it is not performant. I usually prefer to store index rather than values.

ArrayList<String> words = new ArrayList();

int[] occurrence = new int[2000];
Arrays.sort(data);


int nwords = 0;
occurrence[nwords]=1;
words.add(data[0]);        
for (int i = 1; i < data.length; i ++ ) {
    if(!data[i].equals(data[i-1])){ //if a new word is found
        words.add(data[i]);         //put it into the words ArrayList
        nwords++;                   //increment the index
        occurrence[nwords]=0;       //initialize its occurrence counter
    }
    occurrence[nwords]++;           //increment the occurrence counter
}

int max;
for(int k=0; k<5; k++){  //loop to find 5 times the most talked word
  max = 0;               //index of the most talked word
  for(int i = 1; i<words.size(); i++) { //for every word
    if(occurrence[i]>occurrence[max]) { //if it is more talked than max
      max = i;                          //than it is the new most talked
    }
  }
  println("The most talked keyword is " + words.get(max) + " occuring " + occurence[max] + " times.");
  occurence[max]=0;
}

Every time I find the value with the higher occurence value, i set his occurrence counter to 0 and I reiterate again the array, this for 5 times.

Devid Farinelli
  • 7,514
  • 9
  • 42
  • 73
  • EDIT returns The most talked about keyword is occurring 5 times. – Nebbyyy Mar 28 '15 at 23:58
  • Ok so now outputs 5 which is great but just counts up on the same word output is: The most talked keyword is #youth occuring 60 times. The most talked keyword is #youth occuring 59 times. The most talked keyword is #youth occuring 58 times. The most talked keyword is #youth occuring 57 times. The most talked keyword is #youth occuring 56 times. – Nebbyyy Mar 29 '15 at 00:05
  • What does it prints if you edit your print like this? println("The most talked keyword is " + data[max] + " occuring " + occurence[max] + " times, having index "+ max ); – Devid Farinelli Mar 29 '15 at 00:08
  • The most talked keyword is #youth occuring 60 times, having index 227 The most talked keyword is #youth occuring 59 times, having index 228 The most talked keyword is #youth occuring 58 times, having index 229 The most talked keyword is #youth occuring 57 times, having index 230 The most talked keyword is #youth occuring 56 times, having index 231 – Nebbyyy Mar 29 '15 at 00:09
  • Sadly not :( getting returned : The most talked keyword is occuring 1 times. Index is 0 The most talked keyword is occuring 1 times. Index is 1 The most talked keyword is occuring 1 times. Index is 2 The most talked keyword is occuring 1 times. Index is 3 The most talked keyword is occuring 1 times. Index is 4 – Nebbyyy Mar 29 '15 at 00:25
  • Edited. I'm also looking if there are other problems – Devid Farinelli Mar 29 '15 at 00:30
  • could you comment on the code as to your thought process as you created it if it is not too much trouble to ask – Nebbyyy Mar 29 '15 at 00:34
  • Edited, let me know if you need something else. Thanks a lot for the upvote ;) – Devid Farinelli Mar 29 '15 at 00:41
0

You could try something simple like this..

int count = 0;

for( int i = 0; i < words.size(); i++ ){
    System.out.printf("%s: ", words.get( i ));
    for( int j = 0; j < words.size(); j++ ) {
        if( words.get( i ).equals( words.get( j ) ) )
            count++;
    }                                               
    System.out.printf( "%d\n", count );
}
Michael Queue
  • 1,340
  • 3
  • 23
  • 38