3

I am implementing a very rudimentary inverted index, and I am having trouble implementing a phrase searching method.

I have the following structure:

InvertedIndex.java: here I have a data structure:

private Map<String, ArrayList<Postings>> index = new HashMap<String, ArrayList<Postings>>(); 

where I store a word and a Posting list with all of the docId's and associated term positions in a document.

My Postings.java class has the following structure:

private Map<String, ArrayList<Integer>> postings; 

I have getters and setters for all of these data structures, so I am not including them because it would be too much for this post. The string is the docId, and the Arraylist holds all positions in a document for a word.

I have an class where I am implementing the following method to search by phrase:

    public ArrayList<String> searchByPhrase(String...terms){
        if (terms == null || terms.length < 2){
            return null; 
        }

        ArrayList<String> documents = new ArrayList<String>(); 

        for (int i = 0; i < terms.length; i++){
            ArrayList<Postings> postings1 = index.getPostings(terms[i]);
            if ((i + 1) < terms.length){
                ArrayList<Postings> postings2 = index.getPostings(terms[i+1]);

                int smaller = 0; 
                if (postings2.size() < postings1.size()){
                    smaller = postings2.size(); 
                }
                else {
                    smaller = postings1.size(); 
                }

                for (int j = 0; j < smaller; j++){
                        Postings p1 = postings1.get(j); 
                        Postings p2 = postings2.get(j); 
                        if (p1.containsID(p2.getDocId())){
                            System.out.println("FOUND MATCHING DOC");
                            //Do position checking in here
                        }   
                }
            }

        }


        return documents; 
    }

I know that within this method I have to check so that positions are within one spot away from each other. I have not yet implemented that because I want to first be able to find documents that are the same (which this is currently not doing). When I run this I get nothing back, and I have various documents that I know share words.

I want this method to be able to search for terms of various sizes ("hello world", "Thank you so much for your help", etc....). I feel like I am overcomplicating this, but I am very lost as to how to tackle it. Any suggestions would be appreciated.

Anderology
  • 147
  • 1
  • 2
  • 9
  • p1,containsID must be checking only one posting. You need to check the whole list. Really, to make it efficient, you need to build a set of all the document IDs in one list and see if any of the IDs in the other list are in that set. But you will get that for free when you start merging the lists to find sequential word positions, so don't bother doing it as a separate step. – Matt Timmermans Mar 25 '16 at 03:20

0 Answers0