0

My program is opening a file and then saves its words and their byte distance from the file beginning . Though the file has too many duplicate words that i don't want . Also i want my list to be in alphabetical order . The problem is that when i fix the order the duplicate are messed and vice versa . Here is my code:

import java.io.*;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.Set;

class MyMain {
        public static void main(String[] args) throws IOException {
            ArrayList<DictPage> listOfWords = new ArrayList<DictPage>(); 
            LinkedList<Page> Eurethrio = new LinkedList<Page>(); 
            File file = new File("C:\\Kennedy.txt");
            BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
            //This will reference one line at a time...
            String line = null;
            int line_count=0;
            int byte_count; 
            int total_byte_count=0; 
            int fromIndex;

            int kat = 0;
            while( (line = br.readLine())!= null ){
                line_count++;
                fromIndex=0;
                String [] tokens = line.split(",\\s+|\\s*\\\"\\s*|\\s+|\\.\\s*|\\s*\\:\\s*");
                String line_rest=line;
                for (int i=1; i <= tokens.length; i++) {
                    byte_count = line_rest.indexOf(tokens[i-1]);
                    //if ( tokens[i-1].length() != 0)
                    //System.out.println("\n(line:" + line_count + ", word:" + i + ", start_byte:" + (total_byte_count + fromIndex) + "' word_length:" + tokens[i-1].length() + ") = " + tokens[i-1]);
                    fromIndex = fromIndex + byte_count + 1 + tokens[i-1].length();
                    if (fromIndex < line.length())
                        line_rest = line.substring(fromIndex);
                    if(!listOfWords.contains(tokens[i-1])){//Na mhn apothikevetai h idia leksh
                        //listOfWords.add(tokens[i-1]);
                        listOfWords.add(new DictPage(tokens[i-1],kat));
                        kat++;
                    }

                    Eurethrio.add(new Page("Kennedy",fromIndex));
                    }
                    total_byte_count += fromIndex;
                    Eurethrio.add(new Page("Kennedy", total_byte_count));
            }

            Set<DictPage> hs = new HashSet<DictPage>();
            hs.addAll(listOfWords);
            listOfWords.clear();
            listOfWords.addAll(hs);

            if (listOfWords.size() > 0) {
                Collections.sort(listOfWords, new Comparator<DictPage>() {
                    @Override
                    public int compare(final DictPage object1, final DictPage object2) {
                        return object1.getWord().compareTo(object2.getWord());
                    }
                   } );
               }
            //Ektypwsh leksewn...
            for (int i = 0; i<listOfWords.size();i++){
                System.out.println(""+listOfWords.get(i).getWord()+" "+listOfWords.get(i).getPage());
            }
            for (int i = 0;i<Eurethrio.size();i++){
                System.out.println(""+Eurethrio.get(i).getFile()+" "+Eurethrio.get(i).getBytes());
            }
        }
}
  • 2
    Possible duplicate of [How do I remove repeated elements from ArrayList?](http://stackoverflow.com/questions/203984/how-do-i-remove-repeated-elements-from-arraylist) – Oli Mar 22 '16 at 13:28
  • Either first sort the list and then construct the set (you could use a `LinkedHashSet` to preserve order) or use a `TreeSet` as already has been suggested. – Thomas Mar 22 '16 at 13:31

3 Answers3

2

Use the TreeSet instead of ArrayList, and you'll get automatically order and no repeatings.

Gregory Prescott
  • 574
  • 3
  • 10
  • i get this error when i use tree set : DictPage cannot be cast to java.lang.Comparable –  Mar 22 '16 at 13:50
  • Sure. You should create your own Comporator for DictPage. Look here for example: http://stackoverflow.com/questions/2748829/create-a-sortedmap-in-java-with-a-custom-comparator – Gregory Prescott Mar 22 '16 at 13:52
0

use this.

public void stripDuplicatesFromFile(String filename) {
            try {
                BufferedReader reader = new BufferedReader(new FileReader(filename));
                Set<String> lines = new HashSet<String>(); 
                String line;
                while ((line = reader.readLine()) != null) {
                    lines.add(line);
                }
                reader.close();
                BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
                for (String unique : lines) {
                    writer.write(unique);
                    writer.newLine();
                }
                writer.close();
            } catch (FileNotFoundException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }

        }

it takes filepath as an input, find duplicate lines and remove them. But if you have large file do not use this. I'm using this method on a very small size of a .txt file (kind of log file and order is not imported).

Y.Kaan Yılmaz
  • 612
  • 5
  • 18
  • There are several issues here: 1) OP doesn't seem to want to write back to a file 2) order would not be preserved nor does your code sort 3) It doesn't make use of `DictPage` nor does the code operate on individual words 4) (more a personal viewpoint) just posting code without any explanation of what's different and why is not going to help that much. – Thomas Mar 22 '16 at 13:37
0

In the first place, why are you using ArrayList to store your list of words.

ArrayList<DictPage> listOfWords = new ArrayList<DictPage>(); 

You should use Set (like HashSet, TreeSet or some implementation of Set) to store your words if you don't want duplicates.

 Set<DictPage> listOfWords = new Hashset<DictPage>(); //no duplicates but not sorted

Or

Set<DictPage> listOfWords = new Treeset<DictPage>(); //no duplicates and sorted as well

This would make sure that your list of words does not contain any duplicates.

And if you want them sorted straight away, you can use TreeSet which will make it more easier for you.

Abubakkar
  • 15,488
  • 8
  • 55
  • 83