4

Given a finite dictionary of entity terms, I'm looking for a way to do Entity Extraction with intelligent tagging using Lucene. Currently I've been able to use Lucene for:
- Searching for complex phrases with some fuzzyness
- Highlighting results

However, I 'm not aware how to:
-Get accurate offsets of the matched phrases
-Do entity-specific annotaions per match(not just tags for every single hit)

I have tried using the explain() method - but this only gives the terms in the query which got the hit - not the offsets of the hit within the original text.

Has anybody faced a similar problem and is willing to share a potential solution?

Thank you in advance for you help!

Kevin Reid
  • 37,492
  • 13
  • 80
  • 108
Dima_F
  • 43
  • 3

1 Answers1

2

For the offset, see this question: How get the offset of term in Lucene?

I don't quite understand your second question. It sounds to me like you want to get the data from a stored field though. To get the data from a stored field:

TopDocs results = searcher.Search(query, filter, num);
foreach (ScoreDoc result in results.scoreDocs)
{
    Document resultDoc = searcher.Doc(result.doc);
    string valOfField = resultDoc.Get("My Field");
}
Community
  • 1
  • 1
Xodarap
  • 11,581
  • 11
  • 56
  • 94
  • The above is to get the offset for a single Term, however, I need the offset of the full Phrase that has matched my search. In terms of the stored field, how would I get the data directly from it for each on of the dictionary phrases? – Dima_F Nov 17 '10 at 18:00
  • @Dima_F: I added code to show how to use stored fields. wrt phrase offsets: I don't think you can. You can take a look at what the [highlighter does](http://www.docjar.org/html/api/org/apache/lucene/search/vectorhighlight/SimpleFragListBuilder.java.html), but your best bet might be to modifier the highlighter code to return the offset. – Xodarap Nov 17 '10 at 18:45
  • Thank you very much for your help on this! I will let you know where I can get with the Highlighter modification. – Dima_F Nov 17 '10 at 18:49