Term Vector Frequency in Lucene 4.0

Question

I'm upgrading from Lucene 3.6 to Lucene 4.0-beta. In Lucene 3.x, the IndexReader contains a method IndexReader.getTermFreqVectors(), which I can use to extract the frequency of each term in a given document and field.

This method is now replaced by IndexReader.getTermVectors(), which returns Terms. How can I make use of this (or probably other methods) to extract the term frequency in a document and a field?

Related to http://stackoverflow.com/questions/13537126/term-frequency-in-lucene-4-0?rq=1 and http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene — Mark Butler, Jan 22 '13 at 00:23

score 14 · Accepted Answer · edited Dec 11 '15 at 15:30

Perhaps this will help you:

// get terms vectors for one document and one field
Terms terms = reader.getTermVector(docID, "fieldName"); 

if (terms != null && terms.size() > 0) {
    // access the terms for this field
    TermsEnum termsEnum = terms.iterator(null); 
    BytesRef term = null;

    // explore the terms for this field
    while ((term = termsEnum.next()) != null) {
        // enumerate through documents, in this case only one
        DocsEnum docsEnum = termsEnum.docs(null, null); 
        int docIdEnum;
        while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
            // get the term frequency in the document 
            System.out.println(term.utf8ToString()+ " " + docIdEnum + " " + docsEnum.freq()); 
        }
    }
}

It helped me at least ! Thank you for these lines – lizzie Jul 23 '13 at 12:44 — lizzie, Jul 23 '13 at 12:44

score 3 · Answer 2 · edited May 23 '17 at 12:25

3

See this related question, specificially

Terms vector = reader.getTermVector(docId, CONTENT);
TermsEnum termsEnum = null;
termsEnum = vector.iterator(termsEnum);
Map<String, Integer> frequencies = new HashMap<>();
BytesRef text = null;
while ((text = termsEnum.next()) != null) {
    String term = text.utf8ToString();
    int freq = (int) termsEnum.totalTermFreq();
    frequencies.put(term, freq);
    terms.add(term);
}

edited May 23 '17 at 12:25

Community

1
1

answered Jan 22 '13 at 00:21

Mark Butler

4,361
2
39
39

In the last step, what is the variable `terms`? – Adam_G Nov 16 '17 at 14:37
terms is an instance of Set defined using the following: private final Set terms = new HashSet<>(); – Ajitesh Apr 04 '18 at 11:51

score 1 · Answer 3 · answered Aug 29 '12 at 04:46

There is various documentation on how to use the flexible indexing apis:

Accessing the Fields/Terms for a documents term vectors is the exact same API you use for accessing the postings lists, since term vectors are really just a miniature inverted index for just that one document.

So its perfectly OK to use all those examples as-is, though you can make some shortcuts since you know there is only ever one document in this "miniature inverted index". e.g. if you just want to get the frequency of a term you can just seek to it and use the aggregate statistics like totalTermFreq (see https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/core/org/apache/lucene/index/package-summary.html#stats), rather than actually opening a DocsEnum that will only enumerate over a single document.

score 0 · Answer 4 · edited Dec 11 '15 at 15:31

I have this working on my Lucene 4.2 index. This is a small test program that works for me.

try {
    directory[0] = new SimpleFSDirectory(new File(test1));
    directory[1] = new SimpleFSDirectory(new File(test2));
    directory[2] = new SimpleFSDirectory(new File(test3));
    directoryReader[0] = DirectoryReader.open(directory[0]);
    directoryReader[1] = DirectoryReader.open(directory[1]);
    directoryReader[2] = DirectoryReader.open(directory[2]);

    if (!directoryReader[2].isCurrent()) {
        directoryReader[2] = DirectoryReader.openIfChanged(directoryReader[2]);
    }
    MultiReader mr = new MultiReader(directoryReader);

    TermStats[] stats=null;
    try {
        stats = HighFreqTerms.getHighFreqTerms(mr, 100, "My Term");
    } catch (Exception e1) {
        e1.printStackTrace();
        return;
    }

    for (TermStats termstat : stats) {
        System.out.println("IBI_body: " + termstat.termtext.utf8ToString() +
            ", docFrequency: " + termstat.docFreq);
    }
}

Term Vector Frequency in Lucene 4.0

4 Answers4

Linked