Optimizing php/mysql translation lookup with huge database and hash indexes

Question

I'm currently using a utf8 mysql database. It checks if a translation is already in the database and if not, it does a translation and stores it in the database.

SELECT * FROM `translations` WHERE `input_text`=? AND `input_lang`=? AND `output_lang`=?;

(The other field is "output_text".) For a basic database, it would first compare, letter by letter, the input text with the "input_text" "TEXT" field. As long as the characters are matching it would keep comparing them. If they stop matching, it would go onto the next row.

I don't know how databases work at a low level but I would assume that for a basic database, it would search at least one character from every row in the database before it decides that the input text isn't in the database.

Ideally the input text would be converted to a hash code (e.g. using sha1) and each "input_text" would also be a hash. Then if the database is sorted properly it could rapidly find all of the rows that match the hash and then check the actual text. If there are no matching hashes then it would return no results even though each row wasn't manually checked.

Is there a type of mysql storage engine that can do something like this or is there some additional php that can optimize things? Should "input_text" be set to some kind of "index"? (PRIMARY/UNIQUE/INDEX/FULLTEXT)

Is there an alternative type of database that is compatible with php that is far superior than mysql?

edit: This talks about B-Tree vs Hash indexes for MySQL:

http://dev.mysql.com/doc/refman/5.5/en/index-btree-hash.html

None of the limitations for hash indexes are a problem for me. It also says

They are used only for equality comparisons that use the = or <=> operators (but are very fast)

["very" was italicized by them]

NEW QUESTION:

How do I set up "input_text" TEXT to be a hash index? BTW multiple rows contain the same "input_text"... is that alright for a hash index?

http://dev.mysql.com/doc/refman/5.5/en/column-indexes.html

Says "The MEMORY storage engine uses HASH indexes by default" - does that mean I've just got to change the storage engine and set the column index to INDEX?

You may be interested in this http://stackoverflow.com/questions/9820801/how-does-mysql-fulltext-search-work — rkosegi, Apr 05 '13 at 08:17
BTW I'm using phpMyAdmin. Also I'm not searching for individual words within a sentence. I'm searching whether an input string exactly matches one from in the database. (if there is a match it then checks for matching input and output languages) — Luke Wenke, Apr 05 '13 at 08:23
Have you considered adding a `LIMIT 1` at the end? I've found that to help in my own usage, as it will speed up the query when the entire goal is to get a single match. — Moshe, Apr 05 '13 at 08:42

score 0 · Answer 1 · answered Apr 05 '13 at 08:28

A normal INDEX clause should be enough (be sure to index all your fields, it'll be big on disk, but faster). FULLTEXT indexes are good when you're using LIKE clauses ;-)

Anyway, for that kind of lookups, you should use a NoSQL store like Redis, it's blazingly fast and has an in-memory store and also does data persistence through snapshots.

There is an extension for php here : https://github.com/nicolasff/phpredis

And you'll have redis keys in the following form: YOUR_PROJECT:INPUT_LANG:WORD:OUTPUT_LANG for better data management, just replace each value with your values and you're good to go ;)

score 0 · Answer 2 · answered Apr 05 '13 at 08:34

0

An index will speed up the lookups a lot.

By default indexes in InnoDB and MyISAM use search trees (B-trees). There is a limitation on the length of the row the index so you will have to index only the 1-st ~700 bytes of text.

CREATE INDEX txt_lookup ON translations (input_lang, output_lang, input_text(255));

This will create an index on input_lang, output_lang and the 1-st 255 characters of input_text.

When you select with your example query MySQL will use the index to find the rows with the appropriate languages and the same starting 255 characters quickly and then it will do the slow string compare with the full length of the column on the small set of rows which it got from the index.

answered Apr 05 '13 at 08:34

Vatev

7,493
1
32
39

I might be misunderstanding but do you mean that "input_text" would still be a TEXT with no size limit? – Luke Wenke Apr 05 '13 at 09:26
It will have a size limit (every datatype has one) but it can be much higher (4GB or something...). The index uses only the 1-st xx chars from the field value. – Vatev Apr 05 '13 at 12:29
You should also be careful with the collation on that table. If you use a case insensitive collation (which is the default one sometimes) "Foo bar" will get the same translation as "foo bar". – Vatev Apr 05 '13 at 12:32

Optimizing php/mysql translation lookup with huge database and hash indexes

2 Answers2