Is this strategy for fast substring search in MySQL fast enough?

Question

I have a USER table with millions of rows. I am implementing a search function that allows someone to look for a user by typing in a username. This autocomplete feature needs to be blazingly fast. Given that, in MySQL, column indexes speed up queries using LIKE {string}%, is the following approach performant enough to return within 200ms? (Note: Memory overhead is not an issue here, username are maximum 30 characters).

Create a USERSEARCH table that has a foreign key to the user table and an indexed ngram username column:

    USERSEARCH
    
    user_id    username_ngram   
    -------------------------
    1          crazyguy23         
    1          razyguy23       
    1          azyguy23      
    1          zyguy23       
    ...

The query would then be:

    SELECT user_id FROM myapp.usersearch WHERE username_ngram LIKE {string}%
    LIMIT 10

I am aware that third party solutions exist, but I would like to stay away from them at the moment for other reasons. Is this approach viable in terms of speed? Am I overestimating the power of indexes if the db would need to check all O(30n) rows where n is the number of users?

Why use `_n` columns for a variable number of pieces of data that should be in rows? — Caius Jard, Jun 20 '20 at 08:11
(In relation to the question before it was edited) Between running 1 query that stops at 10 rows, and running 30 queries that don't necessarily stop early, then concatenating and deduping the results, then limiting them to 10? I'd say so — Caius Jard, Jun 20 '20 at 08:17
(In relation to your edited question) That's better.. now what do your perf tests reveal? We can't guarantee it'll run in <200ms if your server is a pentium 4 with win98 on a CF card as a hard disk — Caius Jard, Jun 20 '20 at 08:18
@CaiusJard haha it is running on Google Cloud. I can't afford to test it until I have proof of concept by the community. If there are 1M users. then the USERSEARCH will have O(30M) rows. Is this too much even for indexes to handle? — Rage, Jun 20 '20 at 08:23
@Strawberry off topic but I agree. I just wanted to simplify the look of the query to make it appear less daunting. However, I do wonder if there is a way to order by similarity DESC — Rage, Jun 20 '20 at 08:37
You can test it on any PC using random data or an importd dictionary. — Paul Spiegel, Jun 20 '20 at 14:06
You're bloating your table with a lot of junk entries. I'm not sure this will provide any speed benefit whatsoever as any gains on look-up speed will be crushed by how big the index gets. — tadman, Jun 21 '20 at 00:50
*Big-O* notation only makes sense when talking about algorithms, not size of data. Just say you have on the order of 100M rows (nearest power of ten is fine, rounding up) to give us an example of the data-set size. — tadman, Jun 21 '20 at 00:51
What about common typos? Transpositions? Etc. Most requests type from the beginning of the word, but a bunch of the rest are cluttered with typos. — Rick James, Jun 21 '20 at 04:25
@RickJames That logic is handled by some logic that repeats the query a dozen times with different scrambled versions of the searched string until 10 arbitrary matches are found. But I may remove this functionality as responsivity is of the essence not so much accuracy. The user just needs to feel well taken care of. — Rage, Jun 21 '20 at 05:26
For reference, the strategy applied here comes from this answer: https://stackoverflow.com/a/22531268/543814 — Timo, Jun 22 '20 at 10:17

Gordon Linoff · Answer 1 · 2020-06-21T11:43:39.810

1

Probably not. The union distinct is going to process each subquery to completion.

If you just want arbitrary rows, phrase this as:

(SELECT user_id
 FROM myapp.usersearch
 WHERE username_1 LIKE {string}%
 LIMIT 10
) UNION DISTINCT
(SELECT user_id
 FROM myapp.usersearch
 WHERE username_2 LIKE {string}%
 LIMIT 10
)
LIMIT 10;

This will at least save you lots of time for common prefixes -- say 'S'.

That said, this just returns an arbitrary list of 10 user_ids when there might be many more.

I don't know if the speed will be fast enough for your application. You have to make that judgement by testing on an appropriate set of data.

edited Jun 21 '20 at 11:43

answered Jun 20 '20 at 10:59

Gordon Linoff

1,242,037
58
646
786

What if the first SELECT finds 10 matches, is there any way to make the next SELECT after the UNION never occur? Perhaps by inserting some sort of variable in the second SELECT (eg. LIMIT 10-count(*) ) – Rage Jun 20 '20 at 18:40
@Rage . . . You *might* be able to do that with CTEs, but that can be tricky. – Gordon Linoff Jun 21 '20 at 00:44
That is two copies of exactly the same `SELECT`. Perhaps you wanted something different? – Rick James Jun 21 '20 at 04:22

score 0 · Answer 2 · answered Jun 22 '20 at 10:16

Assuming SSDs, that should be blazing fast, yes.

Here are some further optimizations:

I would add a DISTINCT to your query, since there is no point in returning the same user_id multiple times. Especially when searching for a very common prefix, such as a single letter.
Also consider searching only for at least 3 letters of input. Less tends to be meaningless (since hopefully your usernames are at least 3 characters long) and is a needless hit on your database.
If you're not adding any more columns (I hope you're not, since this table is meant for blazing fast search!), we can do better. Swap the columns. Make the primary key (username_ngram, user_id). This way, you're searching directly on the primary key. (Note the added benefit of the alphabet ordering of the results! Well... alphabetic on the matching suffixes, that is, not the full usernames.)
Make sure you have an index on user_id, to be able to replace everything for a user if you ever need to change a username. (To do so, just delete all rows for that user_id and insert brand new ones.)
Perhaps we can do even better. Since this is just for fast searching, you could use an isolation level of READ_UNCOMMITTED. That avoids placing any read locks, if I'm not mistaken, and should be even faster. It can read uncommitted data, but so what... Afterwards you'll just query any resulting user_ids in another table and perhaps not find them, if that user was still being created. You haven't lost anything. :)

score -1 · Answer 3 · answered Jun 21 '20 at 02:03

-1

I think you nedd to use mysql full text index to improve performance. You need to change your syntax to use your full text index.

Create full text index:

CREATE FULLTEXT INDEX ix_usersearch_username_ngram ON usersearch(username_ngram);

The official mysql documentation how to use full text index: https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html

answered Jun 21 '20 at 02:03

Francisco Pérez

527
4
9

A full text index would not achieve what I need. It indexes full words. If I had a full text index on username_ngram then the string "guy" would never be found. – Rage Jun 21 '20 at 02:56
By default yes. Read it. If that not work for you it's ok. But i think that it's possible. https://dev.mysql.com/doc/refman/8.0/en/fulltext-search-ngram.html – Francisco Pérez Jun 21 '20 at 03:13

Is this strategy for fast substring search in MySQL fast enough?

3 Answers3