Upgrading Lucene from 3.5 to 4.10 - how to handle Java API changes

Question

I am currently in the process of upgrading a search engine application from Lucene 3.5.0 to version 4.10.3. There have been some substantial API changes in version 4 that break backward compatibility. I have managed to fix most of them, but a few issues remain that I could use some help with:

"cannot override final method from Analyzer"

The original code extended the Analyzer class and the overrode tokenStream(...).

@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
    CharStream charStream = CharReader.get(reader);        
    return
        new LowerCaseFilter(version,
            new SeparationFilter(version,
                new WhitespaceTokenizer(version,
                    new HTMLStripFilter(charStream))));
}

But this method is final now and I am not sure how to understand the following note from the change log:

ReusableAnalyzerBase has been renamed to Analyzer. All Analyzer implementations must now use Analyzer.TokenStreamComponents, rather than overriding .tokenStream() and .reusableTokenStream() (which are now final).

There is another problem in the method quoted above:

"The method get(Reader) is undefined for the type CharReader"

There seem to have been some considerable changes here, too.

"TermPositionVector cannot be resolved to a type"

This class is gone now in Lucene 4. Are there any simple fixes for this? From the change log:

The term vectors APIs (TermFreqVector, TermPositionVector, TermVectorMapper) have been removed in favor of the above flexible indexing APIs, presenting a single-document inverted index of the document from the term vectors.

Probably related to this:

"The method getTermFreqVector(int, String) is undefined for the type IndexReader."

Both problems occur here, for instance:

TermPositionVector termVector = (TermPositionVector) reader.getTermFreqVector(...);

("reader" is of Type IndexReader)

I would appreciate any help with these issues.

Try the Lucene mailing list. IMHO the Lucene people have been far too free with this sort of thing. — user207421, Jan 10 '15 at 23:17

score 1 · Answer 1 · edited May 23 '17 at 11:50

I found core developer Uwe Schindler's response to your question on the Lucene mailing list. It took me some time to wrap my head around the new API, so I need to write down something before I forget.

These notes apply to Lucene 4.10.3.

Implementing an Analyzer (1-2)

new Analyzer() {
    @Override
    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
        Tokenizer source = new WhitespaceTokenizer(new HTMLStripCharFilter(reader));
        TokenStream sink = new LowerCaseFilter(source);
        return new TokenStreamComponents(source, sink);
    }
};

The constructor of TokenStreamComponents takes a source and a sink. The sink is the end result of your token stream, returned by Analyzer.tokenStream(), so set it to your filter chain. The source is the token stream before you apply any filters.
HTMLStripCharFilter, despite its name, is actually a subclass of java.io.Reader which removes HTML constructs, so you no longer need CharReader.

Term vector replacements (3-4)

Term vectors work differently in Lucene 4, so there are no straightforward method swaps. The specific answer depends on what your requirements are.

If you want positional information, you have to index your fields with positional information in the first place:

Document doc = new Document();
FieldType f = new FieldType();
f.setIndexed(true);
f.setStoreTermVectors(true);
f.setStoreTermVectorPositions(true);
doc.add(new Field("text", "hello", f));

Finally, in order to get at the frequency and positional info of a field of a document, you drill down the new API like this (adapted from this answer):

// IndexReader ir;
// int docID = 0;
Terms terms = ir.getTermVector(docID, "text");
terms.hasPositions(); // should be true if you set the field to store positions
TermsEnum termsEnum = terms.iterator(null);
BytesRef term = null;
// Explore the terms for this field
while ((term = termsEnum.next()) != null) {
    // Enumerate through documents, in this case only one
    DocsAndPositionsEnum docsEnum = termsEnum.docsAndPositions(null, null);
    int docIdEnum;
    while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
        for (int i = 0; i < docsEnum.freq(); i++) {
            System.out.println(term.utf8ToString() + " " + docIdEnum + " "
                    + docsEnum.nextPosition());
        }
    }
}

It'd be nice if Terms.iterator() returned an actual Iterable.

Upgrading Lucene from 3.5 to 4.10 - how to handle Java API changes

1 Answers1

Implementing an Analyzer (1-2)

Term vector replacements (3-4)