What database to use for 15M records and a fast search index

Question

So this is something I have been wondering about after working on a project for about nine months.

We have a postgres database and are using sunspot on top of solr in a rails app.

I was not here when we decided to use solr so I dont really know why we chose it in the first place. Everything works great for small sets of data but the real pain in reindexing each record after save.

This lets the index get out of date and we end up proccessing these in delayed jobs. This gets us by for now but each time we decide to re-work the index and how it is built, it takes more than 24h on production and causes our client to get mad.

I should note here that we are searching fields on contacts that are 255 chars at max. Most are only 25 chars. There are no pdf files or word files etc.

The end goal is to have a fast search and to have some auto complete searching. I also want us to fuzzy match a search. I want Bill Smith to match BillSmith and some other things.

To do this now I am custom building a part of the index on the contact model. This works but each time my boss adds a requirement like Bill Smith must Match Bill-Smith, I need to rebuild the index.

Is there a better thing to be using here than solr? I wonder if there is for this purpose. I would like in the end to have a search with some of the same features and speed that google has. (Not to that extreme) But I need the index to be rebuilt fast if I need an index at all.

This is for a rails app with about 15M db records in 30 tables.

Any guidance here would be great as we are about to consider dropping solr.

EDIT: Another question is that do you need an index for fast search? Cant postgres use its own index to get things just as fast?

Check out this question: http://stackoverflow.com/questions/1284083/choosing-a-stand-alone-full-text-search-server-sphinx-or-solr — kddeisz, Jan 08 '14 at 14:23

score 3 · Answer 1 · answered Jan 08 '14 at 14:07

3

Postgres would handle that quite well with full text search…

http://www.postgresql.org/docs/current/static/textsearch.html

Note that it allows to use all sorts of dictionaries if you don't like the built-in rules:

http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html

It also has colorful tools, such as trigrams:

http://www.postgresql.org/docs/current/static/pgtrgm.html

answered Jan 08 '14 at 14:07

Denis de Bernardy

75,850
13
131
154

I mean to say at what point is postgres not a good option? When your db hits 500M records? – joncodo Jan 08 '14 at 15:03
2

I am not an postgres expert, but we are using oracle-text currently in one of our products and want to switch to Solr. This is because we want more features than just search. Like: Faceting, Spellchecking, MoreLikeThis, etc. As soon as you get there, I think you need ElasticSearch or Solr or alike. – cheffe Jan 08 '14 at 17:11
@JonathanO: Postgres runs databases whose size is measures in hundreds to TeraBytes. There are a handful whose size is measured in PetaBytes. Skype ran on Postgres with in excess of 10k transactions per second last I heard. Yahoo reported using it to run a 2PB monster in 2008 (with a modified engine, but that was 5 years ago): http://www.computerworld.com/s/article/9087918/Size_matters_Yahoo_claims_2_petabyte_database_is_world_s_biggest_busiest – Denis de Bernardy Jan 08 '14 at 17:54
@Denis Don't know if GiST scales quite so happily up to those sizes, but 500M records shouldn't be unreasonably challenging. – Craig Ringer Jan 09 '14 at 00:28
@CraigRinger: true. (I'd be curious to know how solr scales nowadays, if you happen to have figures handy -- cursory googling yielded nada. Last I looked at it, Lucene and Solr hadn't merged yet, and scaling beyond a few million docs was problematic.) – Denis de Bernardy Jan 09 '14 at 01:14

score 1 · Answer 2 · answered Jan 09 '14 at 00:35

I also want us to fuzzy match a search. I want Bill Smith to match BillSmith and some other things

While PostgreSQL's full-text search can help you with this sort of thing, you may find that you need to supply a custom set of stems / a custom dictionary, or even write a custom tsearch parser depending on the details of your needs.

The basic tsearch isn't all that easily customised for application-specific text handling rules.

each time my boss adds a requirement like Bill Smith must Match Bill-Smith, I need to rebuild the index

You will have that with PostgreSQL full-text search too - and adding such requirements may be trickier.

Fundamentally I think that's an issue you will have with any indexing system. In theory an index could be partially updated in such a case - e.g. removing all entries for Bill, Smith, or BillSmith then adding them back in according to the new rules. I'm not sure any off the shelf system does that, though.

If you want something that performs like Google you'll probably need to throw monsterous computing resources at the task. It's amazing how fast a search can go when it's parallelized over 1000s nodes that have the data of interest cached in RAM.

What database to use for 15M records and a fast search index

2 Answers2