5

I want to write a custom analyzer in pylucene. Usually in java lucene , when you write a analyzer class , your class inherits lucene's Analyzer class.

but pylucene uses jcc , the java to c++/python compiler.

So how do you let a python class inherit from a java class using jcc ,and especially how do you write a custom pylucene analyzer?

Thanks.

Alex Brooks
  • 5,133
  • 4
  • 21
  • 27

2 Answers2

3

Here's an example of an Analyzer that wraps the EdgeNGram Filter.

import lucene
class EdgeNGramAnalyzer(lucene.PythonAnalyzer):
    '''
    This is an example of a custom Analyzer (in this case an edge-n-gram analyzer)
    EdgeNGram Analyzers are good for type-ahead
    '''

    def __init__(self, side, minlength, maxlength):
        '''
        Args:
            side[enum] Can be one of lucene.EdgeNGramTokenFilter.Side.FRONT or lucene.EdgeNGramTokenFilter.Side.BACK
            minlength[int]
            maxlength[int]
        '''
        lucene.PythonAnalyzer.__init__(self)
        self.side = side
        self.minlength = minlength
        self.maxlength = maxlength

    def tokenStream(self, fieldName, reader):
        result = lucene.LowerCaseTokenizer(Version.LUCENE_CURRENT, reader)
        result = lucene.StandardFilter(result)
        result = lucene.StopFilter(True, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
        result = lucene.ASCIIFoldingFilter(result)
        result = lucene.EdgeNGramTokenFilter(result, self.side, self.minlength, self.maxlength)
        return result

Here's another example of re-implementing PorterStemmer

# This sample illustrates how to write an Analyzer 'extension' in Python.
# 
#   What is happening behind the scenes ?
#
# The PorterStemmerAnalyzer python class does not in fact extend Analyzer,
# it merely provides an implementation for Analyzer's abstract tokenStream()
# method. When an instance of PorterStemmerAnalyzer is passed to PyLucene,
# with a call to IndexWriter(store, PorterStemmerAnalyzer(), True) for
# example, the PyLucene SWIG-based glue code wraps it into an instance of
# PythonAnalyzer, a proper java extension of Analyzer which implements a
# native tokenStream() method whose job is to call the tokenStream() method
# on the python instance it wraps. The PythonAnalyzer instance is the
# Analyzer extension bridge to PorterStemmerAnalyzer.

'''
More explanation... 
Analyzers split up a chunk of text into tokens...
Analyzers are applied to an index globally (unless you use perFieldAnalyzer)
Analyzers implement Tokenizers and TokenFilters.
Tokenizers break up string into tokens. TokenFilters break of Tokens into more Tokens or filter out
Tokens
'''

import sys, os
from datetime import datetime
from lucene import *
from IndexFiles import IndexFiles


class PorterStemmerAnalyzer(PythonAnalyzer):

    def tokenStream(self, fieldName, reader):

        #There can only be 1 tokenizer in each Analyzer
        result = StandardTokenizer(Version.LUCENE_CURRENT, reader)
        result = StandardFilter(result)
        result = LowerCaseFilter(result)
        result = PorterStemFilter(result)
        result = StopFilter(True, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET)

        return result


if __name__ == '__main__':
    if len(sys.argv) < 2:
        sys.exit("requires at least one argument: lucene-index-path")
    initVM()
    start = datetime.now()
    try:
        IndexFiles(sys.argv[1], "index", PorterStemmerAnalyzer())
        end = datetime.now()
        print end - start
    except Exception, e:
        print "Failed: ", e

Checkout perFieldAnalyzerWrapper.java also KeywordAnalyzerTest.py

        analyzer = PerFieldAnalyzerWrapper(SimpleAnalyzer())
        analyzer.addAnalyzer("partnum", KeywordAnalyzer())

        query = QueryParser(Version.LUCENE_CURRENT, "description",
                            analyzer).parse("partnum:Q36 AND SPACE")
        scoreDocs = self.searcher.search(query, 50).scoreDocs
Ben DeMott
  • 3,362
  • 1
  • 25
  • 35
1

You can inherit from any class in pylucene, but the ones with names that start with Python will also extend the underlying Java class, i.e., make the relevant methods "virtual" when called from java code. So in the case of custom analyzers, inherit from PythonAnalyzer and implement the tokenStream method.

A. Coady
  • 54,452
  • 8
  • 34
  • 40
  • Any "secret" to make this work?? I've tried both inheriting from `PythonAnalyzer` and `ReusableAnalyzerBase` and both generate an invalid args exception when creating the query-parser instance – Justin Dec 08 '11 at 17:31
  • @Justin posted a full example of a custom Analyzer passed into the index creation, also adding perFieldAnalyzer - hope this helps. Not sure if the example I posted is so out of date that it won't work with your version. – Ben DeMott Jan 25 '21 at 21:32
  • @BenDeMott your example is out dated but there is a ```test_PerFieldAnalyzerWrapper.py``` file that comes with ```Lucene 8.6.1```. Thanks for the direction. – PSK Jan 27 '21 at 23:30