I'm using Lucene.NET to build a search index of 10 million+ books. I am using this to index a book:
Document doc = new Document();
doc.Add(new Field("id", bookID, Field.Store.YES, Field.Index.NO));
doc.Add(new Field("publisher", publisherName, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("title", bookTitle, Field.Store.YES, Field.Index.ANALYZED));
Search by Publisher:
Since I've indexed book publishers with Index.NOT_ANALYZED
, I can use the high-performance TermsFilter to do the equivalent of:
SELECT * FROM books WHERE publisher="O'Reilly Media"
Search by Title:
And of course since I've indexed the book titles using the Index.ANALYZED
option, I can use the standard QueryParser to do the equivalent of:
SELECT * FROM books WHERE title LIKE "%skating%"
Search by Author:
However now I need to search by author. I need something like:
SELECT * FROM books WHERE title LIKE "%skating%" AND authors CONTAIN "Jack Black"
So how do I go about doing that? I have both author names and author IDs stored per book. How can I index that into the Lucene Document and then quickly search for all books by author? I don't want to use SQL since I need to combine the search keywords with the author filter, so Lucene must do the author filtering for me.
The most obvious solution is:
doc.Add(new Field("authors", "Jack Black; Joan White", Field.Store.YES, Field.Index.ANALYZED));
But this would incorrectly return books where the name of one author is similar/within the name of another author, eg:
- Book 1 : Authors : Jack D Black, Bob A Smith
- Book 2 : Authors : D Black
So at this point searching for "D Black" would incorrectly return Book 1 and Book 2, instead of just Book 2. I need to therefore index the whole author name or ID (using Index.NOT_ANALYZED
), but I need multiple of such fields per book. Is this possible?
// can I add the same field multiple times into a document?
doc.Add(new Field("author", "Jack D Black", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("author", "Bob A Smith", Field.Store.YES, Field.Index.NOT_ANALYZED));
Or I could add the author IDs, such that the analyzer takes each number as an independent word:
doc.Add(new Field("authors", "125;1885;23", Field.Store.YES, Field.Index.ANALYZED));
And then use a regular Lucene search to find all books with the author "125"... Would this work or would this also list books with the author "1254"?