I am implementing a very rudimentary inverted index, and I am having trouble implementing a phrase searching method.
I have the following structure:
InvertedIndex.java: here I have a data structure:
private Map<String, ArrayList<Postings>> index = new HashMap<String, ArrayList<Postings>>();
where I store a word and a Posting list with all of the docId's and associated term positions in a document.
My Postings.java class has the following structure:
private Map<String, ArrayList<Integer>> postings;
I have getters and setters for all of these data structures, so I am not including them because it would be too much for this post. The string is the docId, and the Arraylist holds all positions in a document for a word.
I have an class where I am implementing the following method to search by phrase:
public ArrayList<String> searchByPhrase(String...terms){
if (terms == null || terms.length < 2){
return null;
}
ArrayList<String> documents = new ArrayList<String>();
for (int i = 0; i < terms.length; i++){
ArrayList<Postings> postings1 = index.getPostings(terms[i]);
if ((i + 1) < terms.length){
ArrayList<Postings> postings2 = index.getPostings(terms[i+1]);
int smaller = 0;
if (postings2.size() < postings1.size()){
smaller = postings2.size();
}
else {
smaller = postings1.size();
}
for (int j = 0; j < smaller; j++){
Postings p1 = postings1.get(j);
Postings p2 = postings2.get(j);
if (p1.containsID(p2.getDocId())){
System.out.println("FOUND MATCHING DOC");
//Do position checking in here
}
}
}
}
return documents;
}
I know that within this method I have to check so that positions are within one spot away from each other. I have not yet implemented that because I want to first be able to find documents that are the same (which this is currently not doing). When I run this I get nothing back, and I have various documents that I know share words.
I want this method to be able to search for terms of various sizes ("hello world", "Thank you so much for your help", etc....). I feel like I am overcomplicating this, but I am very lost as to how to tackle it. Any suggestions would be appreciated.