I don't want to use Lucene because i think it is to heavy.
Is there any easier way to implement this (Millons of data) ?
I don't want to use Lucene because i think it is to heavy.
Is there any easier way to implement this (Millons of data) ?
If you don't want to have to worry about performance, I recommend you take a look at Amazon Web Services new CloudSearch service. It's fast and scales as your needs scale. It can also handle millions of documents without a problem and supports wildcard searches (ex: quo*, would retrieve Quora).
Check it out here.
Obviously this isn't how it definitely works at either Quora or Google, as I haven't had the pleasure to work at either...this is just how I'd go about doing it.
The first thing to obtain is a list of search terms - I'm assuming you don't want to know how this is done, as it will really depend on all sorts of things, but basically you're either going to do a select distinct title from pages
(in the case of the autocomplete on Wikipedia) or something much more advanced in the case of Google's.
The next step is also pretty simple at a high level: you need to perform the query select title from titles where title like 'Qu%'
in the case of the user typing Qu
into the search box. The list of titles is then returned to the browser as the response to some kind of Ajax request, perhaps in the form of JSON or similar. And you need to do it as fast as possible - that's where it becomes difficult.
How do they do it so quickly? There are probably four things to bear in mind.
So, that's how I would do it. Not having had the experience of the enormous datasets Google/Quora are dealing with, I'm sure there are things that I've not considered. But, it's a start.
And, here's how I have done it, purely in an experimental environment at home:
I had a simple list of a good few hundred thousand titles to search. These were loaded into a dedicated MongoDB collection, which had a single index defined on it. I then had a Play Framework controller in front of it and used jQuery's autocomplete plugin to do the search.
Obviously this is tiny compared with what you are looking for, but MongoDB should provide the same kind of performance for your dataset provided you follow the recommendations (ie good hardware, lots of RAM, keep the indexes in memory). In addition, Mongo supports sharding, and the Play Framework is shared nothing, so adding new machines to cope with the load should your userbase grow would be straightforward in this situation.
By the way, Mongo is by no means the only solution, traditional SQL databases will be up to the job too, of course - I was just using Mongo for other reasons.
First, for autocomplete you should aim to get the response back to the user in <= 100ms if you want something that appears fast. That should be your first concern. Any setup that can't do that probably won't be good enough for users. In my own tests in Firefox using Firebug, Google's autocomplete returned returns in about 50ms and Quora in about 65ms.
See, e.g.
Apparently, Quora uses prefix matching, not full text search which makes it faster. To roll your own fast prefix-based autocomplete, which should be sufficient for many cases, but won't handle things like misspellings using fuzzy matching, etc., try an in-memory data store like Redis. The details can be seen here:
http://charlesleifer.com/blog/powerful-autocomplete-with-redis-in-under-200-lines-of-python/
I haven't been able to get CloudSearch (95-125ms in browser fetching from endpoint directly as measured by Firebug, and + 20-30ms longer accessing endpoint via cURL in PHP) down to the low latencies of Google and Quora I cited regardless of the simplicity of the search query. An Elasticsearch cluster is a bit faster. These statements obviously depend upon use case and probably don't generalize well, but something to think about.