Lucene get list of matched keywords

Question

I have a Java (lucene 4) based application and a set of keywords fed into the application as a search query (the terms may include more than one words, eg it can be: “memory”, “old house”, “European Union law”, etc).

I need a way to get the list of matched keywords out of an indexed document and possibly also get keyword positions in the document (also for the multi-word keywords). I tried with the lucene highlight package but I need to get only the keywords without any surrounding portion of text. It also returns multi-word keywords in separate fragments.

I would greatly appreciate any help.

score 0 · Answer 1 · edited May 23 '17 at 12:14

0

There's a similar (possibly same) question here: Get matched terms from Lucene query

Did you see this?

The solution suggested there is to disassemble a complicated query into a more simple query, until you get a TermQuery, and then check via searcher.explain(query, docId) (because if it matches, you know that's the term).

I think It's not very efficient, but it worked for me until I ran into SpanQueries. it might be enough for you.

edited May 23 '17 at 12:14

Community

1
1

answered Apr 16 '15 at 12:40

Yossi Vainshtein

3,845
4
23
39

Yes I have already seen it. Thank you. However I have a very long query composed by more than 3M keywords. This is not very efficient. I was wondering if there is a low level "service" that keeps matched keyword list after each search is performed. – Nikolaos Papadakis Apr 17 '15 at 05:25
What I have tried so far is to use a highlighter and get the matched fragment around the keyword. But unfortunately this seems to get each word of a multiword keyword as separate match eg if searching for “European countries” it returns: “...in the European countries the population is...”. What I need to achieve is to have both words within the same custom tag. So I can deduce that they belong to the same keyword. – Nikolaos Papadakis Apr 17 '15 at 05:37

Lucene get list of matched keywords

1 Answers1