2

I use InnoDB on RDS, which unfortunately does not yet support MySQL full text search. I'm therefore looking into alternatives. My app is on Heroku and I have considered the various addons that provide search capabilities, but have a very large table of companies (~100M records) and I think that they are prohibitively expensive. I only need to be able to search one field on the table -- company name.

I am therefore considering creating my own 'keyword' table. Essentially this would list every word contained in every company name. There would then be another table that shows association between these keywords and the company_id.

  1. Does this sound like a good idea? Are there any better alternatives?

  2. What would be the most efficient way of creating the keyword table and the association table? I'd like to do it using T-SQL, if possible.

alpheus
  • 979
  • 6
  • 15
  • 34

1 Answers1

2

You can do it, and it's far better than using LIKE '%word%' queries.

But it's not nearly as good as using proper fulltext indexing.

See my presentation Full Text Search Throwdown, where I compare the fulltext solutions for MySQL, including trigrams, which is approximately like the keyword solution you're considering.

The fastest solution -- by far -- was Sphinx Search.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • 1
    Agreed. Any external full-text search engine will be faster and much more powerful than MySQL-based search. You can download Sphinx here: http://sphinxsearch.com/downloads/release/ – vfedorkov Dec 24 '12 at 06:26
  • Doesn't it depend on the indexes your realize in MySQL? e.g. if you only create a "word table" this won't be really intelligent. But if you extend it by using damerau-levenshtein http://stackoverflow.com/q/14232851/318765 it will find similar words, too. In the next step you could add a second table with "word groups" ("The dog plays" results "the dog", "dog plays"). So you are able to find the best matching result based on full sentences as well. This wil result in something like BM25 https://en.wikipedia.org/wiki/Okapi_BM25 A question of concept. – mgutt Feb 26 '15 at 16:31
  • @mgutt, if you try to use Damerau-Levenshtein distance calculation, there's no way to index that; it will be forced to calculate the distance for every row in your table. You're welcome to try it out, but I expect it will be expensive to do that search, on the same order of magnitude as the `LIKE '%word%'` search. – Bill Karwin Feb 26 '15 at 17:17
  • You can use indexes. Again it depends on the concept: http://stackoverflow.com/a/14261807/318765 Read my additional comments there. You will find two ideas to use indexes (word length and word chars based on integers). – mgutt Feb 26 '15 at 19:48
  • @mgutt, I maintain there is no way to use an index for the Damerau-Levenshtein distance calculation. What you suggested was a way to limit the subset of rows against which you must calculate that string distance function, but that doesn't make the string distance itself use an index. – Bill Karwin Feb 27 '15 at 02:51