Custom Tokenizer for pylucene which tokenizes text based only on underscores (retains spaces)

Question

I am new to pylucene and I am trying to build a custom analyzer which tokenizes text on the basis of underscores only, i.e. it should retain the whitespaces. Example: "Hi_this is_awesome" should be tokenized into ["hi", "this is", "awesome"] tokens.

From various code examples I understood that I need to override the incrementToken method for a CustomTokenizer and write a CustomAnalyzer for which the TokenStream needs to use the CustomTokenizer followed by a LowerCaseFilter to achieve the same.

I am facing problems in implementing the incrementToken method and connecting the dots (how the tokenizer maybe used as usually the Analyzers depend on TokenFilter which depend on TokenStreams) as there is very little documentation available on pylucene.

It would be helpful to see what you have so far. – femtoRgon May 08 '13 at 16:17 — femtoRgon, May 08 '13 at 16:17

score 3 · Answer 1 · answered May 09 '13 at 19:45

Got it working eventually by creating a new tokenzier which considered every char other than an underscore as part of the token generated (basically underscore becomes the separator)

class UnderscoreSeparatorTokenizer(PythonCharTokenizer):
  def __init__(self, input):
    PythonCharTokenizer.__init__(self, input)

  def isTokenChar(self, c):
    return c != "_"

class UnderscoreSeparatorAnalyzer(PythonAnalyzer):
  def __init__(self, version):
    PythonAnalyzer.__init__(self, version)

  def tokenStream(self, fieldName, reader):
    tokenizer = UnderscoreSeparatorTokenizer(reader)
    tokenStream = LowerCaseFilter(tokenizer)
    return tokenStream

Custom Tokenizer for pylucene which tokenizes text based only on underscores (retains spaces)

1 Answers1

Linked