0

I have a Java (lucene 4) based application and a set of keywords fed into the application as a search query (the terms may include more than one words, eg it can be: “memory”, “old house”, “European Union law”, etc).

I need a way to get the list of matched keywords out of an indexed document and possibly also get keyword positions in the document (also for the multi-word keywords). I tried with the lucene highlight package but I need to get only the keywords without any surrounding portion of text. It also returns multi-word keywords in separate fragments.

I would greatly appreciate any help.

1 Answers1

0

There's a similar (possibly same) question here: Get matched terms from Lucene query

Did you see this?

The solution suggested there is to disassemble a complicated query into a more simple query, until you get a TermQuery, and then check via searcher.explain(query, docId) (because if it matches, you know that's the term).

I think It's not very efficient, but it worked for me until I ran into SpanQueries. it might be enough for you.

Community
  • 1
  • 1
Yossi Vainshtein
  • 3,845
  • 4
  • 23
  • 39
  • Yes I have already seen it. Thank you. However I have a very long query composed by more than 3M keywords. This is not very efficient. I was wondering if there is a low level "service" that keeps matched keyword list after each search is performed. – Nikolaos Papadakis Apr 17 '15 at 05:25
  • What I have tried so far is to use a highlighter and get the matched fragment around the keyword. But unfortunately this seems to get each word of a multiword keyword as separate match eg if searching for “European countries” it returns: “...in the European countries the population is...”. What I need to achieve is to have both words within the same custom tag. So I can deduce that they belong to the same keyword. – Nikolaos Papadakis Apr 17 '15 at 05:37