4

I'm trying to produce something similar to what QueryParser in lucene does, but without the parser, i.e. run a string through StandardAnalyzer, tokenize this and use TermQuery:s in a BooleanQuery to produce a query. My problem is that I only get Token:s from StandardAnalyzer, and not Term:s. I can convert a Token to a term by just extracting the string from it with Token.term(), but this is 2.4.x-only and it seems backwards, because I need to add the field a second time. What is the proper way of producing a TermQuery with StandardAnalyzer?

I'm using pylucene, but I guess the answer is the same for Java etc. Here is the code I've come up with:

from lucene import *
def term_match(self, phrase):
    query = BooleanQuery()
    sa = StandardAnalyzer()               
    for token in sa.tokenStream("contents", StringReader(phrase)):
        term_query = TermQuery(Term("contents", token.term())
        query.add(term_query), BooleanClause.Occur.SHOULD)
Joakim Lundborg
  • 10,920
  • 6
  • 32
  • 39

2 Answers2

2

The established way to get the token text is with token.termText() - that API's been there forever.

And yes, you'll need to specify a field name to both the Analyzer and the Term; I think that's considered normal. 8-)

RichieHindle
  • 272,464
  • 47
  • 358
  • 399
  • According to the API docs, token.termText() is deprecated, and they point me to instead using something like token.termBuffer()[0:token.termLength()] which works, but seems even more awkward. – Joakim Lundborg Sep 08 '09 at 08:38
0

I've come across the same problem, and, using Lucene 2.9 API and Java, my code snippet looks like this:

final TokenStream tokenStream = new StandardAnalyzer(Version.LUCENE_29)
    .tokenStream( fieldName , new StringReader( value ) );
final List< String > result = new ArrayList< String >();
try {
while ( tokenStream.incrementToken() ) {
  final TermAttribute term = ( TermAttribute ) tokenStream.getAttribute( TermAttribute.class );
  result.add( term.term() );
}
Daniel Hiller
  • 3,415
  • 3
  • 23
  • 33