Get field's tokens from lucene index

Question

How can I get the tokens (whether it be the list of tokens, TokenStream, or something else) that were used for a Field within a Document from a lucene index? That is, is it possible to get the tokens that were used in tokens (from the example) from the index? (I'm not wondering how to get tokens out of a TokenStream)

doc.add(new Field("title", tokens))

In the documentation there's Field.tokenStreamValue() but when I do doc.getFieldable(field_name) that simply returns null.

I've also tried (from the third comment in lucene - Fieldable.tokenStreamValue()):

TokenSources.getTokenStream(reader, doc_id, field_name)

but I get

java.lang.IllegalArgumentException: title in doc #630does not have any term position data stored
    at org.apache.lucene.search.highlight.TokenSources.getTokenStream(TokenSources.java:256)

score 2 · Accepted Answer · answered Mar 20 '12 at 08:53

The TokenSources class is a helper class to retrieve the tokens of a document for highlighting purposes. There are two ways to retrieve the terms for a given document:

re-analyzing a stored field,
reading the document's terms vector.

The method you want to use tries to read the document's terms vector, but fails because you didn't enable term vectors at indexing time.

So you can either enable term vectors at indexing time and keep using this method (see Field constructor and the documentation of Field.TermVector) or re-analyze the content of your stored fields. The first method may provide better performance, especially for large fields whereas the second one will save space (there is no additional information to store if your field is already stored).

Get field's tokens from lucene index

1 Answers1