8

How does the Lucene 4.3.1 highlighter work? I want to print out the search results(as the searched word and 8 words after that word) from the document. How can I use the Highlighter class to do that? I have added full txt, html and xml documents to a file and added those into my index, now I have a search formula, from which I will presumably be adding the highlighter capability:

String index = "index";
String field = "contents";
String queries = null;
int repeat = 1;
boolean raw = true; //not sure what raw really does???
String queryString = null; //keep null, prompt user later for it
int hitsPerPage = 10; //leave it at 10, go from there later

//need to add all files to same directory
index = "C:\\Users\\plib\\Documents\\index";
repeat = 4;


IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);

BufferedReader in = null;
if (queries != null) {
  in = new BufferedReader(new InputStreamReader(new FileInputStream(queries), "UTF-8"));
} else {
  in = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
}
QueryParser parser = new QueryParser(Version.LUCENE_43, field, analyzer);
while (true) {
  if (queries == null && queryString == null) {                        // prompt the user
    System.out.println("Enter query. 'quit' = quit: ");
  }

  String line = queryString != null ? queryString : in.readLine();

  if (line == null || line.length() == -1) {
    break;
  }

  line = line.trim();
  if (line.length() == 0 || line.equalsIgnoreCase("quit")) {
    break;
  }

  Query query = parser.parse(line);
  System.out.println("Searching for: " + query.toString(field));

  if (repeat > 0) {                           // repeat & time as benchmark
    Date start = new Date();
    for (int i = 0; i < repeat; i++) {
      searcher.search(query, null, 100);
    }
    Date end = new Date();
    System.out.println("Time: "+(end.getTime()-start.getTime())+"ms");
  }

  doPagingSearch(in, searcher, query, hitsPerPage, raw, queries == null && queryString == null);

  if (queryString != null) {
    break;
  }
}
reader.close();

}

Kevin Reid
  • 37,492
  • 13
  • 80
  • 108
abitnew
  • 245
  • 2
  • 4
  • 10
  • 2
    I'd try referring to the [documentation](http://lucene.apache.org/core/4_0_0/highlighter/org/apache/lucene/search/highlight/package-summary.html#package_description) and giving it a shot. – femtoRgon Jul 08 '13 at 22:53
  • 3
    I read that, but it still didn't make sense. I am a bit confused about where to go with the highlighter class and functions. Plus the documentation is just code not much explanation. – abitnew Jul 08 '13 at 22:58

2 Answers2

10

I had the same question, and finally stumbled up this post.

http://vnarcher.blogspot.ca/2012/04/highlighting-text-with-lucene.html

The key part is that as you iterate over your results, will call getHighlightedField on the result value that you want to highlight.

private String getHighlightedField(Query query, Analyzer analyzer, String fieldName, String fieldValue) throws IOException, InvalidTokenOffsetsException {
    Formatter formatter = new SimpleHTMLFormatter("<span class="\"MatchedText\"">", "</span>");
    QueryScorer queryScorer = new QueryScorer(query);
    Highlighter highlighter = new Highlighter(formatter, queryScorer);
    highlighter.setTextFragmenter(new SimpleSpanFragmenter(queryScorer, Integer.MAX_VALUE));
    highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);
    return highlighter.getBestFragment(this.analyzer, fieldName, fieldValue);
}

In this case, it assumes the output is going to be html, and it simply wraps the highlighted text with the <span> using a css class of MatchedText. You can then define a custom css rule to do whatever you want for highlighting.

stuckless
  • 6,515
  • 2
  • 19
  • 27
  • 2
    The link you provided is dead. It seems that the new location for this post is : http://vnarcher.blogspot.fr/2012/04/highlighting-text-with-lucene.html – potame Feb 09 '17 at 14:25
  • 1
    Great. For those using dependency management, Gradle, etc, you have to include this line `compile 'org.apache.lucene:lucene-highlighter:[n.n.n]'` in your build file to get hold of the `org.apache.lucene.search.highlight` package. – mike rodent Feb 20 '17 at 19:56
7

For the Lucene highlighter to work you need to add two fields in your document that you are indexing. One field should be with Term Vector enabled and another field without using Term Vector. For simplicity I am showing you a code snippet:

    FieldType type = new FieldType();
    type.setIndexed(true);
    type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
    type.setStored(true);
    type.setStoreTermVectors(true);
    type.setTokenized(true);
    type.setStoreTermVectorOffsets(true);
    Field field = new Field("content", "This is fragment. Highlters", type);
    doc.add(field);  //this field has term Vector enabled.

    //without term vector enabled.
    doc.add(new StringField("ncontent","This is fragment. Highlters", Field.Store.YES));

After enabling them add that document in your index. Now to make use of lucene highlighter use the method given below (It uses Lucene 4.2, I have not tested with Lucene 4.3.1) :

         public void highLighter() throws IOException, ParseException, InvalidTokenOffsetsException {
    IndexReader reader = DirectoryReader.open(FSDirectory.open(new File("INDEXDIRECTORY")));
    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
    IndexSearcher searcher = new IndexSearcher(reader);
    QueryParser parser = new QueryParser(Version.LUCENE_42, "content", analyzer);
    Query query = parser.parse("Highlters"); //your search keyword
    TopDocs hits = searcher.search(query, reader.maxDoc());
    System.out.println(hits.totalHits);
    SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
    Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
    for (int i = 0; i < reader.maxDoc(); i++) {
        int id = hits.scoreDocs[i].doc;
        Document doc = searcher.doc(id);
        String text = doc.get("ncontent");
        TokenStream tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), id, "ncontent", analyzer);
        TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 4);
        for (int j = 0; j < frag.length; j++) {
            if ((frag[j] != null) && (frag[j].getScore() > 0)) {
                System.out.println((frag[j].toString()));
            }
        }
        //Term vector
        text = doc.get("content");
        tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), hits.scoreDocs[i].doc, "content", analyzer);
        frag = highlighter.getBestTextFragments(tokenStream, text, false, 10);
        for (int j = 0; j < frag.length; j++) {
            if ((frag[j] != null) && (frag[j].getScore() > 0)) {
                System.out.println((frag[j].toString()));
            }
        }

        System.out.println("-------------");
    }
}         
user1234
  • 128
  • 4