How to search over huge non-text based data sets?

Question

In a project I am working, the client has a an old and massive(terabyte range) RDBMS. Queries of all kinds are slow and there is no time to fix/refactor the schema. I've identified the sets of common queries that need to be optimized. This set is divided in two: full-text and metadata queries.

My plan is to extract the data from their database and partition it across two different storage systems each optimized for a particular query set.

For full-text search, Solr is the engine that makes most sense. It's sharding and replication features make it a great fit for half of the problem.

For metadata queries, I am not sure what route to take. Currently, I'm thinking of using an RDBMS with an extremely de-normalized schema that represents a particular subset of the data from the "Authoritative" RDBMS. However, my client is concerned about the lack of sharding and replication of such subsystem and difficulty/complications of setting such features as compared with Solr that already includes them. Metadata in this case takes the form of integers, dates, bools, bits, and strings(with max size of 10chars).

Is there a database storage system that features built-in sharding and replication that may be particular useful to query said metadata? Maybe a no-sql solution out there that provides a good query engine?

Illuminate please.

Additions/Responses:

Solr can be used for metadata, however, the metadata is volatile. Therefore, I would have to commit often to the indexes. This would cause search to degrade pretty fast.

@Charles, excellent question, metadata is in the order of a couple of hundreds of gigabytes and fewer than a terabyte of storage. Roughly speaking, 500GB-1000GB at any give point in time. This peculiarity is due because I intend to enforce some kind of archiving policy to draw the line between real-time search and job-based search (e.g. Your search will take a some time to process, come back in a few minutes to see your results). In the context of this question, of course, I'm addressing the near-real-time use case. — Newbie, May 20 '11 at 08:48
Ah, is there a particular reason why you don't want to do the metadata queries in Solr as well? Its quite capable of doing all those other data types as well. — Femi, May 13 '11 at 04:52
Solr can be used for metadata, however, the metadata is volatile. Therefore, I would have to commit often to the indexes. This would cause search to degrade pretty fast. Umm, maybe some kind of index management strategy that mitigates this problem could yield the desired results? I'll think about this. — Newbie, May 13 '11 at 05:01
Ah. I was under the impression the db was mostly historical and static, not rapidly changing. You are starting to wander into distributed search land: I expect you'll have to roll your own index management on top of some other solution (or pay someone else to roll it for you if you have the budget :)) — Femi, May 13 '11 at 05:14

Mauricio Scheffer · Answer 1 · 2011-05-17T01:19:35.097

23

RavenDB:

It has Lucene built-in for your full-text searches.
It can be replicated.
It supports sharding.
It has a HTTP API so you can in principle use it from any platform.

Cons: it's AGPL licensed. Depending on your dev/server environment, you could consider it running on .NET a con. Also I'm not aware of the status of clients for other plaforms than .NET.

Solandra:

Integrates Solr and Cassandra
Full-text search managed by Solr
Replication and sharding managed by Cassandra

Cons: not yet released.

ElasticSearch:

It has Lucene built-in for your full-text searches.
It can be replicated.
It supports sharding.
It has a HTTP API so you can in principle use it from any platform.

ElasticSearch looks similar to RavenDB but it seems to emphasize full-text search where RavenDB emphasizes being a general NoSQL database.

edited May 17 '11 at 01:19

answered May 13 '11 at 14:05

Mauricio Scheffer

98,863
23
192
275

Do these databases manage lucene indexes automatically(in the background)? (e.g. Is index fragmentation something you still need code against) If search, for these databases, relies on lucene, wouldn't I be better off having two distinct Solr deployments tailored to both of my query needs? I'm a bit confused in terms of the additional value gained from using your recommended databases over Solr which I would end up using anyways. Thanks for you help! – Newbie May 16 '11 at 04:29
@Newbie: I have no first-hand experience with either of these databases (I do have experience with Solr though), but they claim to have good support for near-realtime-search (your last requirement) – Mauricio Scheffer May 17 '11 at 01:05
@Newbie: added ElasticSearch, it also claims to do near-realtime search – Mauricio Scheffer May 17 '11 at 01:19
2

@Newbie; RavenDb processes its Lucene indexes on background workers which provides the notion of Eventual Consistency which means that they might be stale while still providing results. You can, however, write Map/Reduce indexes and with a feature called Live projections you can Map out a subset of the data and the project that data into an index and also combine data from several document types. – Mikael Östberg May 17 '11 at 07:32

score 4 · Accepted Answer · answered May 23 '11 at 18:30

4

Use MongoDB for your metadata store:

Built-in sharding
Built-in replication
Failover & high availability
Simple query engine that should work for most common cases

However, the downside is that you can not perform joins. Be smart about denormalizing your data so that you can avoid this.

answered May 23 '11 at 18:30

alan

870
7
24

AFAIK MongoDB doesn't include full-text search, which was one of the OP's requirements! – Mauricio Scheffer May 23 '11 at 18:36
1

He already had the full-text side of the solution figured out, he was looking for a separate system to search over metadata. – alan May 23 '11 at 19:47
integration is non-trivial... the engines I suggested offer both capabilities, *integrated* – Mauricio Scheffer May 23 '11 at 20:01

score 2 · Answer 3 · answered May 20 '11 at 14:23

I'm sure your aware that you are not going to get fast query times on any system that has frequent updates. To implement sharding yourself against an RDBMS you would need to find some key to split the records on and populate multiple databases. Then you could query them all at the same time to get and process the data in a map reduce fashion. This would allow you to grow the number of machines as your data grows, and possibly allow you to increase the speed of the operation. From a quick google search both MongoDB and Hadoop provide this map/reduce functionality, I am unfamiliar with both.

It is not uncommon for complex long running reports to be generated on the fly. However this is usually accompanied by an email notification when the report has finished generation. It makes for a good push notification format to interface with humans. Also if these reports are expected in a cyclical fashion (e.g. weekly, monthly, etc.) you can still use the email notification when these reports are ready, the only difference is the kick off time for generation is automated.

Another ninja downvote. Please leave a comment so I can know why I recieved it. — Charles Lambert, May 22 '11 at 23:07

score 2 · Answer 4 · answered May 23 '11 at 16:07

If you use elasticsearch, you can simply add the metadata as extra keys of the json document:

{
    "message": ... your full text,
    "date": "2009-11-15T14:12:12",
    ...
}

Then you can search with both at the same time. Otherwise, if you still want to do the two system approach, monogoDB is a document store with auto-sharding that has some pretty advanced query mechanisms (fields, map-reduce, indexes for fast queries).

How to search over huge non-text based data sets?

4 Answers4