0

I'm using Lucene.NET to build a search index of 10 million+ books. I am using this to index a book:

Document doc = new Document();

doc.Add(new Field("id", bookID, Field.Store.YES, Field.Index.NO));

doc.Add(new Field("publisher", publisherName, Field.Store.YES, Field.Index.NOT_ANALYZED));

doc.Add(new Field("title", bookTitle, Field.Store.YES, Field.Index.ANALYZED));

Search by Publisher:

Since I've indexed book publishers with Index.NOT_ANALYZED, I can use the high-performance TermsFilter to do the equivalent of:

SELECT * FROM books WHERE publisher="O'Reilly Media"

Search by Title:

And of course since I've indexed the book titles using the Index.ANALYZED option, I can use the standard QueryParser to do the equivalent of:

SELECT * FROM books WHERE title LIKE "%skating%"

Search by Author:

However now I need to search by author. I need something like:

SELECT * FROM books WHERE title LIKE "%skating%" AND authors CONTAIN "Jack Black"

So how do I go about doing that? I have both author names and author IDs stored per book. How can I index that into the Lucene Document and then quickly search for all books by author? I don't want to use SQL since I need to combine the search keywords with the author filter, so Lucene must do the author filtering for me.

The most obvious solution is:

doc.Add(new Field("authors", "Jack Black; Joan White", Field.Store.YES, Field.Index.ANALYZED));

But this would incorrectly return books where the name of one author is similar/within the name of another author, eg:

  • Book 1 : Authors : Jack D Black, Bob A Smith
  • Book 2 : Authors : D Black

So at this point searching for "D Black" would incorrectly return Book 1 and Book 2, instead of just Book 2. I need to therefore index the whole author name or ID (using Index.NOT_ANALYZED), but I need multiple of such fields per book. Is this possible?

// can I add the same field multiple times into a document?
doc.Add(new Field("author", "Jack D Black", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("author", "Bob A Smith", Field.Store.YES, Field.Index.NOT_ANALYZED));

Or I could add the author IDs, such that the analyzer takes each number as an independent word:

doc.Add(new Field("authors", "125;1885;23", Field.Store.YES, Field.Index.ANALYZED));

And then use a regular Lucene search to find all books with the author "125"... Would this work or would this also list books with the author "1254"?

Robin Rodricks
  • 110,798
  • 141
  • 398
  • 607

1 Answers1

0

Thanks to Lucas I figured you can add the same field multiple times during indexing:

foreach (string author in authors){
   doc.Add(new Field("author", author, Field.Store.YES, Field.Index.NOT_ANALYZED));
}

This allows you to use the high-performance TermsFilter for searching exact matches.

Robin Rodricks
  • 110,798
  • 141
  • 398
  • 607