8

I have a large dataset with 5M rows. One of the fields in the dataset is 'article_title', which I'd like to search in real-time for an autocomplete feature that I'm building on my site.

I've been experimenting with MySQL and MongoDB as potential DB solutions. Both perform well when an index is used, for example for 'something%', but I need to match titles within a string, as in '%something%'.

Both MySQL and MongoDB took 0.01 seconds with an index using forward-looking search, and about 6 seconds with a full string search.

I realize that the entire DB needs to be scanned for a string-in-string type search so what is the common approach to this problem? Solr and Sphinx seem like overkill for this one problem so I'm trying to avoid using them if possible.

If I got a box with 2 GB of RAM and a 40GB SSD (which is what I can afford at the moment), would I be able to get sub-second response time? Thanks in advance.

--

UPDATE: I tried a fulltext index and while the results are very fast, it doesn't really satisfy a string-in-string search ("presiden" doesn't match "president"). I'm looking for ways to match a string-in-string with a 5M row dataset.

soulkphp
  • 3,753
  • 2
  • 17
  • 14
  • Give more info about MySQL (version, engine, structre, query you used) and MongoDB (cfg, version, client) – kwarunek Aug 10 '13 at 22:29
  • MySQL 5.1.7, Mongod 2.4.5. The MySQL table is MyISAM since it's exclusively reads and i'm only looking for performance. – soulkphp Aug 10 '13 at 22:50
  • See also http://stackoverflow.com/questions/17973889/what-is-the-best-optimization-technique-for-a-wildcard-search-through-100-000-re/18025870#18025870 should be ok for Titles, but would not work for body-content – rlb Aug 11 '13 at 00:19

2 Answers2

6

In the case of MySQL, you can create a full-text index. To put it simply, a full-text index makes partial text matches fast by indexing each word. To create an index you would write:

alter table YourTable add fulltext index(article_title);

After that you can search with:

select * from YourTable where match(article_title) against ('something');

It seems that MongoDB also has text indexes. I imagine the indexing can be fine-tuned in either case, so you'll have to test which is better for your case.

Joni
  • 108,737
  • 14
  • 143
  • 193
  • Good to know... I thought fulltext indexes give you better results but didn't know they speed up partial text matching. I'll do that now and follow up with the results. Thanks! – soulkphp Aug 10 '13 at 22:47
  • 3
    With the limitation, as far as I know, that full-text index will index *words*: whereas the pattern `%bar%` will match `foobar`, `barfoo` and `foobarfoo`, searching for the word `bar` in full-text index will not necessary find the words containing that *substring*. Or is my knowledge on that topic outdated? – Sylvain Leroux Aug 10 '13 at 22:55
  • @SylvainLeroux I believe you are right. Nevertheless, that limitation seems acceptable for the purpose described by OP. – GolezTrol Aug 10 '13 at 23:16
  • Also read the http://stackoverflow.com/questions/609935/mysql-full-text-indexing-limitations – kwarunek Aug 10 '13 at 23:37
  • The fulltext index helped as i'm getting sub-second response times now however as Sylvain mentioned, the results aren't quite string-in-string searching. The results aren't terrible quality but 'presiden' doesn't return 'president'. That makes for a pretty crummy autocomplete feature. Any other ideas? – soulkphp Aug 11 '13 at 07:02
  • You could use an external search engine. There are several: lucene, sphinx, ... Or create a basic engine yourself by extracting words into a separate table, with a normal search index which allows searching by prefix – Joni Aug 11 '13 at 09:17
  • @Joni Any ideas on how to implement a fast '%phrase%' search in MongoDB without scanning the entire collection? Assuming the article titles are only 200 characters long (at most), what's an easy way to find string-in-string matching? – soulkphp Aug 11 '13 at 22:49
  • I'm not really a mongodb expert, sorry. – Joni Aug 12 '13 at 07:09
3

When using a regular index, which is typically implemented as a BTREE, the index works from left-to-right. So a query like something% will work because the left-side of the index can be used. With a query like %something or %something% such an index cannot be used.

A Full-Text index is different in that it indexes uncommon words. Common words (stop-words), like the for example, are excluded. MySQL full-text index also leaves out words that are 3 characters or smaller.

For small cases the built-in Full-Text index will work just fine. The built-in full text indexes usually only take you so far though, so at some point you may need to use a dedicated solution, like Elastic Search or Spynx.

Luke
  • 13,678
  • 7
  • 45
  • 79