I want to read index from my Indexer file.
So the result that i want are all terms of each documents and number of TF-IDF.
Please suggest some example code for me. Thx :)
I want to read index from my Indexer file.
So the result that i want are all terms of each documents and number of TF-IDF.
Please suggest some example code for me. Thx :)
First things is to get a listing of documents. An alternative might be iterating through indexed terms, but the method IndexReader.terms()
appears to have been removed from 4.0 (though it exists in AtomicReader
, which could be worth looking at). The best method I'm aware of to get all documents is to simply loop through the documents by the document id:
//where reader is your IndexReader, however you go about opening/managing it
for (int i=0; i<reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
//operate on the document with id = i ...
}
Then you need a listing of all indexed terms. I'm assuming we have no interest in stored fields, since the data you want doesn't make sense for them. For retrieving the terms you can use IndexReader.getTermVectors(int)
. Note, I'm not actually retrieving the document, since we don't need to access it directly. Continuing from where we left off:
String field;
FieldsEnum fieldsiterator;
TermsEnum termsiterator;
//To Simplify, you can rely on DefaultSimilarity to calculate tf and idf for you.
DefaultSimilarity freqcalculator = new DefaultSimilarity()
//numDocs and maxDoc are not the same thing:
int numDocs = reader.numDocs();
int maxDoc = reader.maxDoc();
for (int i=0; i<maxDoc; i++) {
if (reader.isDeleted(i))
continue;
fieldsiterator = reader.getTermVectors(i).iterator();
while (field = fieldsiterator.next()) {
termsiterator = fieldsiterator.terms().iterator();
while (terms.next()) {
//id = document id, field = field name
//String representations of the current term
String termtext = termsiterator.term().utf8ToString();
//Get idf, using docfreq from the reader.
//I haven't tested this, and I'm not quite 100% sure of the context of this method.
//If it doesn't work, idfalternate below should.
int idf = termsiterator.docfreq();
int idfalternate = freqcalculator.idf(reader.docFreq(field, termsiterator.term()), numDocs);
}
}
}