1
public class TestArticles {

public static void handlewords() throws IOException {

    String path = "C:\\Features.txt";
    String path1 = "C:\\train.txt";
    String path2 = "C:\\test.txt";

    File file = new File(path2);
    PrintWriter pw = new PrintWriter(file);


    Features ft = new Features();
    String content = ft.readFile(path);
    String [] words = content.split(" ");

    FileReader fr = new FileReader(path1);  
    BufferedReader br = new BufferedReader(fr);
    String line = null;
    while ((line = br.readLine()) != null) {       
    String [] word = line.split(" ");

    List<String> list1 = new ArrayList<String>(words.length);
    List<String> list2 = new ArrayList<String>(word.length);

   for(String s: words){
       list1.add(s);
       HashSet set = new HashSet(list1);
       list1.clear();
       list1.addAll(set);
    }

     for(String x: word){
        list2.add(x);
        HashSet set = new HashSet(list2);
            list2.clear();
            list2.addAll(set);
    }

   boolean first = true;
   pw.append("{");
    for(String x: list1){
        for(String y: list2){
            if(x.equalsIgnoreCase(y)){
                if(first){
                   first = false; 
                } else {
                    pw.append(",");
                }
              pw.append(list1.indexOf(x) + 39 +" "+ "1");
            }
        }       
    }
       pw.append("}");
       pw.append("\r\n");
       pw.flush();   
    }
     br.close();
     pw.close();

}

My output file something like:

  1. {23 1,35 1,56 1,56 1,...}
  2. {2 1,4 1,7 1,...}

The first line some data duplicated, the second line all the data in order without duplicated data. How can I delete those duplicated data? I already used hashset, however it did not work.

Mike
  • 97
  • 2
  • 11

2 Answers2

2

The items in your list1 and list2 are correctly unique, but in a case sensitive way. So you might have items in it like man and Man. But then in your last loop you use x.equalsIgnoreCase(y), and since "man".equalsIgnoreCase("man") and "man".equalsIgnoreCase("MAn") are both true, that's how duplicates appear.

There are several ways to fix that:

  • When you build list1 and list2, lowercase the items
  • Or, use a TreeSet instead of HashSet, with a comparator that ignores case
  • Change x.equalsIgnoreCase(y) to x.equals(y)
janos
  • 120,954
  • 29
  • 226
  • 236
  • Janos, it still has the duplicated data. – Mike Apr 11 '14 at 19:32
  • Janos,I tried to use the TreeSet, it also has duplicated deta. – Mike Apr 11 '14 at 19:45
  • The input has a large data set, the first input is wordlist like: 1 word hello... and displayed line by line. And another input has couple articles, like "I am a man........" "I am a woman......" – Mike Apr 11 '14 at 20:00
  • You can see my code, I first put data in list, then put the list into hashset, at last clear the arraylist and put the set into arraylist again. – Mike Apr 11 '14 at 20:07
  • Janos, you are awesome. It works, I used the Treeset instead of hashset, and equals instead of equasIgnoreCase. Thank you so much. – Mike Apr 11 '14 at 20:19
1

Try override equals on your Hashsets, like this:

HashSet set = new HashSet(list1){
    public boolean equals(Object o) {
        return this.toString().equals(o.toString());
    };
};
DanW
  • 247
  • 1
  • 9
  • Dan, I want to find a way to duplicate the data automatically instead of looking up to that string. Do you have any idea? – Mike Apr 11 '14 at 19:34
  • didn't understand your comment.. you want to take off duplicated data, right? What didn't work as expected? – DanW Apr 11 '14 at 19:51
  • Yes, I want to remove the duplicated data. – Mike Apr 11 '14 at 19:57