Crawling and indexing on Database

Question

We have two sites, one developed in RoR and other in Python(Django). MongoDB is used as a data store for the sites. The sites are login-based. So, users are able to see only their data and not of other user's. Also, there are many models in MongoDB and these models are inter-related among themselves.

We have to develop a search feature similar to Gmail search. In Gmail search box, there are many fields like label:, to:, from:, attachment: etc for filtering purposes. If none of these fields is selected, then a normal search is done. What is astonishing is that Gmail search results are fetched in less than 1 second on 256 kbps bandwidth speed for any search query.

It is not feasible to search for a keyword by calling multiple queries in all models. For solutions on crawling DB data, google search was looked up.

While doing a google search on "search engine", there was a result mentioning about crawling and indexing web pages. The tools available are Lucene Solr+Nutch and Sphinx. But it is meant for crawling web pages and storing the keywords into db using Nutch and indexing keywords and searching them using Solr.

Googling on "database search engine" doesn't provide any concrete results.

In this link, in second point, it was stated that MongoDB etc. seem to serve a purpose where there is no requirement of searching and/or faceting. So, does it mean that crawling and indexing on MongoDB is not feasible?

In general sense, is there anything like crawling and indexing on databases, irrespective of database tools (MySQL, SQLite, PostgreSQL, MongoDB etc)?

Update:

The sites we have developed is very similar to Gmail, except that it is not about mail service. We just need to develop a search feature. So, Gmail users are able to see their mails, not other's mails. Similarly, contents on our sites are specific to users. Hope it clarifies the problem.

WiredPrairie · Accepted Answer · 2013-02-21T13:14:16.480

Given your requirements as I understand them, you don't really want a web crawler as you want to index the text of documents, not web pages, and documents are private to individuals, so you need to apply document restrictions on the search scope. While you could crawl them in some secure manner (and organized by user) it seems highly inefficient and very indirect.

You'd need to extract your data and store it in a full text search index system. While some databases might have full-text search built-in, many currently require a secondary system, like ElasticSearch, or Lucene for example. The 2.4 release of MongoDB is planned to have at least a preview of full text search integrated into the database.

As I mentioned some databases have built in full text search. Most have mixed reviews for quality of search results and performance. Given your requirements, you're likely best served by a dedicated scalable full text search solution.

So, when new data comes into your system, you'll tag and index it in the full text search system.

There's not going to be a general purpose indexing system across databases systems. You'll need to wire it up for the most effective search system. You'll have better results if you do this manually I'd suspect anyway.

Crawling and indexing on Database

1 Answers1