0

For a homework assignment, we're to be turning the basicCompare method into something that will compare two text documents and see if they're about similar topics. Basically, the program will strip out all of the words less than five characters in length, and it leaves us with lists. We're supposed to compare the lists, and make it so if the words are used enough between the two documents (let's say 80% similarity) the method returns true and says the "match."

However, I got stuck right about where all the comments are at the bottom of the method. I can't think of or find a way to compare the two lists and find out what percentage of the words are in both lists. Maybe I'm thinking about it wrong, and need to filter out words that aren't in both lists and then just count how many words are left. The parameters for defining whether or not the input documents match are left entirely up to us, so those can be set however I want. If you kind ladies and gentlemen could just point me in the right direction, even to a Java doc page on a certain function, I'm sure I can get the rest of the way. I just need to know where to start.

import java.util.Collections;
import java.util.List;

public class MyComparator implements DocumentComparator {

        public static void main(String args[]){
                MyComparator mc = new MyComparator();

if(mc.basicCompare("C:\\Users\\Quinncuatro\\Desktop\\MatchLabJava\\LabCode\\match1.txt", "C:\\Users\\Quinncuatro\\Desktop\\MatchLabJava\\LabCode\\match2.txt")){
                    System.out.println("match1.txt and match2.txt are similar!");
            } else {
                    System.out.println("match1.txt and match2.txt are NOT similar!");
            }
    }

    //In the basicCompare method, since the bottom returns false, it results in the else statement in the calling above, saying they're not similar
    //Need to implement a thing that if so many of the words are shared, it returns as true

    public boolean basicCompare(String f1, String f2) {
            List<String> wordsFromFirstArticle = LabUtils.getWordsFromFile(f1);
            List<String> wordsFromSecondArticle = LabUtils.getWordsFromFile(f2);

            Collections.sort(wordsFromFirstArticle);
            Collections.sort(wordsFromSecondArticle);//sort list alphabetically

            for(String word : wordsFromFirstArticle){
                    System.out.println(word);
            }

            for(String word2 : wordsFromSecondArticle){
                    System.out.println(word2);
            }

            //Find a way to use common_words to strip out the "noise" in the two lists, so you're ONLY left with unique words
            //Get rid of words not in both lists, if above a certain number, return true
            //If word1 = word2 more than 80%, return true

            //Then just write more whatever.basicCompare modules to compare 2 to 3, 1 to 3, 1 to no, 2 to no, and 3 to no

            //Once you get it working, you don't need to print the words, just say whether or not they "match"

            return false;

    }


    public boolean mapCompare(String f1, String f2) {

            return false;
    }

}

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
  • http://stackoverflow.com/questions/2762093/java-compare-two-lists – Brian Roach Apr 03 '12 at 01:07
  • although you showed the code, it is best if you can show your effort on the *main meat* of the problem, rather than just provide the scaffolding code around it. – cctan Apr 03 '12 at 01:15

2 Answers2

2

Try to come up with an algorithm by performing the steps on paper, or in your head. Once you understand what you need to do, translate that into code. This is how all algorithms are invented.

Sky Kelsey
  • 19,192
  • 5
  • 36
  • 77
  • I understand that, Sky Kelsey, this is just a matter of learning Java and not knowing if what I want to do is translatable into clean code. – Henry Edward Quinn IV Apr 03 '12 at 01:05
  • 1
    A couple things: 1) You are storing Strings in a List, which allows duplicates. There are other data structures, is it possible that another data structure does not allow duplicates, and would make your job easier? 2) There is an interface java.util.Collection that List, and other "Collections" implement. Look through the common methods declared in that interface. One is called "contains", and may help you as well. – Sky Kelsey Apr 03 '12 at 01:09
1

Start off by changing your List 's to Set to remove duplicates.

Iterate over one of the sets and use the contains method to check to see if the other contains the same words.

int count = 0;
Set<String> set1 = new HashSet<String>(LabUtils.getWordsFromFile(f1));
Set<String> set2 = new HashSet<String>(LabUtils.getWordsFromFile(f2));

Iterator<String> it = set1.iterator();

while (it.hasNext()){
    String s = it.next();

    if (set2.contains(s)){
        count++;
    }

}

Then use the counter to calculate the percentage (count / total) * 100. If that is greater than 80% then return true, else return false.

Its always good to understand the difference between list, sets and queues. I hope this points you in the right direction.

Barnesy
  • 253
  • 3
  • 15