104

I would like to find out how lucene search works so fast. I can't find any useful docs on the web. If you have anything (short of lucene source code) to read, let me know.

A text search query using mysql5 text search with index takes about 18 minutes in my case. A lucene search for the same query takes less than a second.

jan
  • 2,263
  • 5
  • 19
  • 22
Midhat
  • 17,454
  • 22
  • 87
  • 114

4 Answers4

91

Lucene is an inverted full-text index. This means that it takes all the documents, splits them into words, and then builds an index for each word. Since the index is an exact string-match, unordered, it can be extremely fast. Hypothetically, an SQL unordered index on a varchar field could be just as fast, and in fact I think you'll find the big databases can do a simple string-equality query very quickly in that case.

Lucene does not have to optimize for transaction processing. When you add a document, it need not ensure that queries see it instantly. And it need not optimize for updates to existing documents.

However, at the end of the day, if you really want to know, you need to read the source. Both things you reference are open source, after all.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
  • If I understand correctly, the thing that sets text search engines apart is how they handle multi-word searches and joining the results of searches to multiple indexes in real time. I would not suggest consulting Lucene source for this. It would probably be better to read a little about text search theory, @alienCoder's answer helped me. – Chris Dutrow Apr 20 '14 at 01:18
  • 1
    @bmargulies, If the indexing is "per word", then why does the stackoverflow user search http://stackoverflow.com/users allow substring matches? – Pacerier Dec 07 '14 at 12:53
  • 2
    This is not the place for whole-book answers. There are any number of elaborations on the basic concept in there. – bmargulies Dec 07 '14 at 13:04
  • What do you mean "an index for each word"...if I start typing "abc", how is it going to find "abc" in the document? – Alexander Mills Mar 11 '19 at 20:28
  • 1
    An index (B-tree) from word to document can search for documents by words in the document because the table of such an index is (word, document) where the index is on the word column. Consider a query like: "Find documents with words 'police','crime','statistics'" in them. By searching the word index, you can do three log(N) searches to get O(N) documents with one of those words in them. Then you can do two O(N) loops to build a set containing documents that have all three words. Although this is theoretically a O(N) operation, most documents don't have all three words so its O(n) where n < N. – Calicoder Jun 24 '20 at 19:06
41

Lucene creates a big index. The index contains word id, number of docs where the word is present, and the position of the word in those documents. So when you give a single word query it just searches the index (O(1) time complexity). Then the result is ranked using different algorithms. For multi-word query just take the intersection of the set of files where the words are present. Thus Lucene is very very fast.

For more info read this article by Google developers- http://infolab.stanford.edu/~backrub/google.html

devdanke
  • 1,309
  • 15
  • 27
alienCoder
  • 1,461
  • 3
  • 17
  • 23
  • 9
    Skimmed over that paper, it was pretty helpful. Specifically "4.5 Searching" had the answer I was looking for. Specifically, it sounds like an O(1) hash search is used for individual words, but then an O(n) scan is used to join the results with a 40,000 document limit. I assume a map-reduce algorithm is used to split this work up so that the user gets instantaneous results. – Chris Dutrow Apr 20 '14 at 01:19
  • One popular algorithm is pigeon rank algorithm. Although I don't know much about it. – alienCoder Apr 20 '14 at 14:16
  • 3
    That paper is amusing: "In this paper, we present Google, a prototype...". I guess Google wasn't always a mega-corporation. – Buttons840 Jul 18 '14 at 01:17
  • don't know Lucene, but one question: Ranking happens at each search? Or does it maintains the documents pre-ranked? If it maintain the documents as per rank in advance how does it maintain for multiple words query? – Vikas Prasad Aug 29 '17 at 14:33
  • The link is broken now. @alienCoder – CEGRD Aug 25 '20 at 14:26
21

In a word: indexing.

Lucene creates an index of your document that allows it to search much more quickly.

It's the same difference between a list O(N) data structure and a hash table O(1) data structure. The list has to walk through the entire collection to find what you want. The hash table has an index that lets it figure out exactly where the desired item is and simply fetch it.

Update:

I'm not certain what you mean by "Lucene index searches are a lot faster than mysql index searches."

My guess is that you're using MySQL "WHERE document LIKE '%phrase%'" to search for a document. If that's true, then MySQL has to do a table scan on every row, which will be O(N).

Lucene gets to parse the document into tokens, group them into n-grams at your direction, and calculate indexes for each one of those. It's O(1) to find a word in an indexed Lucene document.

duffymo
  • 305,152
  • 44
  • 369
  • 561
  • 11
    Yes I understand the indexing part, but again, lucene index searches are a lot faster than mysql index searches. How does that happen – Midhat Apr 24 '10 at 19:19
14

Lucene works with Term frequency and Inverse document frequency. It creates an index mapping each word with the document and it's frequency count which is nothing but inverse index on the document.

Example :

File 1 : Random Access Memory is the main memory.

File 2 : Hard disks are secondary memory.

Lucene creates a reverse index something like

File 1 :

Term : Random

Frequency : 1

Position : 0

Term : Memory

Frequency : 2

Position : 3

Position : 6

So it is able to search and retrieve the searched content quickly. When there is too many matches for the search query it outputs the result based on the weight. Consider the search query "Main Memory" it searches for all 4 words individually and the result would be like,

Main

File 1 : Frequency - 1

Memory

File 1 : Frequency - 2

File 2 : Frequency - 1

The result would be File1 followed by File2. To stop getting carried away by weights on most common words like 'and', 'or', 'the' it considers the inverse document frequency (ie' it decreases the weight of the word which is most popular among the document set).

Tom Taylor
  • 3,344
  • 2
  • 38
  • 63