This question says look at this question... but unfortunately these clever people's solutions no longer seem to work with Lucene 6, because the signature of createComponents
is now
TokenStreamComponents createComponents(final String fieldName)...
i.e. the Reader
is no longer supplied.
Anyone know what the present technique should be? Are we meant to make the Reader
a field of the Analyzer
class?
NB I don't actually want to filter anything, I want to get hold of the streams of tokens in order to create my own data structure (for frequency analysis and sequence-matching). So the idea is to use Lucene's Analyzer
technology to produce different models of the corpus. A trivial example might be: one model where everything is lower-cased, another where casing is left as in the corpus.
PS I also saw this question: but once again we have to supply a Reader
: i.e. I'm assuming that the context was tokenising for the purpose of querying. When writing an index, although clearly the Analyzers
in early versions were getting a Reader
from somewhere when createComponents
was called, you don't yet have a Reader
(that I know of...)