I am new to pylucene and I am trying to build a custom analyzer which tokenizes text on the basis of underscores only, i.e. it should retain the whitespaces. Example: "Hi_this is_awesome" should be tokenized into ["hi", "this is", "awesome"] tokens.
From various code examples I understood that I need to override the incrementToken method for a CustomTokenizer and write a CustomAnalyzer for which the TokenStream needs to use the CustomTokenizer followed by a LowerCaseFilter to achieve the same.
I am facing problems in implementing the incrementToken method and connecting the dots (how the tokenizer maybe used as usually the Analyzers depend on TokenFilter which depend on TokenStreams) as there is very little documentation available on pylucene.