17

How can queries like

SELECT * FROM sometable WHERE somefield LIKE '%value%'

be optimized?

The main issue here is the first wildcard which prevents DBMS from using index.

Edit: What is more, somefield value is solid string (not a piece of text) so fulltext search could not be performed.

Jonas
  • 2,910
  • 2
  • 26
  • 36
  • 1
    IF you need to find substring in value its possible your database setup could be tweeked to give you a better option. Can you please provide examples of data/queries that you are actually using with some context. – MindStalker Jan 17 '10 at 19:33

4 Answers4

21

How long are your strings?

If they are relatively short (e.g. English words; avg_len=5) and you have database storage to spare, try this approach:

  • For each word that you want to store in the table, instead take every possible suffix of that word. In other words, you keep stripping the first character until nothing is left. For example, the word value gives:
    • value
    • alue
    • lue
    • ue
    • e
  • Store each of these suffixes in the database.
  • You can now search for substrings using LIKE 'alu%' (which will find 'alu' as part of 'value').

By storing all suffixes, you have removed the need for the leading wildcard (allowing an index to be used for fast lookup), at the cost of storage space.

Storage Cost

The number of characters required to store a word becomes word_len*word_len / 2, i.e. quadratic in the word length, on a per-word basis. Here is the factor of increase for various word sizes:

  • 3-letter word: (3*3/2) / 3 = 1.5
  • 5-letter word: (5*5/2) / 5 = 2.5
  • 7-letter word: (7*7/2) / 7 = 3.5
  • 12-letter word: (12*12/2) / 12 = 6

The number of rows required to store a word increases from 1 to word_len. Be mindful of this overhead. Additional columns should be kept to a minimum to avoid storing large amounts of redundant data. For instance, a page number on which the word was originally found should be fine (think unsigned smallint), but extensive metadata on the word should be stored in a separate table on a per-word basis, rather than for each suffix.

Considerations

There is a trade-off in where we split 'words' (or fragments). As a real-world example: what do we do with hyphens? Do we store the adjective five-letter as one word or two?

The trade-off is as follows:

  • Anything that is broken up cannot be found as a single element. If we store five and letter separately, searching for five-letter or fiveletter will fail.
  • Anything that is not broken up will take more storage space. Remember, the storage requirement increases quadratically in the word length.

For convenience, you might want to remove the hyphen and store fiveletter. The word can now be found by searching five, letter, and fiveletter. (If you strip hyphens from any search query as well, users can still successfully find five-letter.)

Finally, there are ways of storing suffix arrays that do not incur much overhead, but I am not yet sure if they translate well to databases.

Timo
  • 7,992
  • 4
  • 49
  • 67
  • 5
    This is a **very** good answer and it's the only answer that solves the problem. (Admittedly it has the limitation that your strings must be short enough that you don't mind multiplying the number of rows by the average string length, but that is probably unavoidable.) – antinome Dec 11 '14 at 15:14
  • It is 2020 and I am considering to use your solution. Do you have any updates? In addition, would an index have to be created for each string truncation column? If so, what do the queries look like? SELECT * FROM user_table WHERE username_1 LIKE %string% OR username_2 LIKE %string% OR username_3 LIKE %string% ......? – Rage Jun 19 '20 at 08:42
  • @Rage Answered in chat: https://chat.stackoverflow.com/rooms/216259/room-for-timo-and-rage. – Timo Jun 19 '20 at 09:12
  • See also how ElasticSearch [solves this](https://www.elastic.co/blog/find-strings-within-strings-faster-with-the-new-elasticsearch-wildcard-field) (scroll down to "how it works") by storing a bunch of 3-char engrams for each string. It is a comparable approach, but with the distinct advantage of a storage cost that is **linear** rather than quadratic in the word size. It comes at the cost of more complex lookup logic and, technically, reduced guarantees on lookup performance (due to potential false positives). – Timo May 18 '21 at 14:41
5

Use Full Text Search. The "Initial Idea" heading has the same example and leads to worked example solution.

And the MySQL docs

Edit: It can't be tuned in SQL itself. Using functions like LOCATE or PATINEX won't help either.

gbn
  • 422,506
  • 82
  • 585
  • 676
  • 1
    actually I don't need to find some specific word in text. I need to find substring in value (I'll update question to clarify this). – Jonas Jan 17 '10 at 18:06
  • It doesn't matter whether whole word or not: you can *not* optimise this query – gbn Jan 17 '10 at 18:08
  • 1
    Maybe there are more complex solutions than just optimizing query to perform this type of search faster. – Jonas Jan 17 '10 at 18:15
  • full text search is not used for too short words or for stop words – luky Sep 30 '22 at 09:20
5

Two ways:

(1) use an in-memory table so it goes very fast.

(2) cook up a better index and search algorithm than foo LIKE '%bar%'. It's not possible to make any suggestions about this without knowing more about your problem.

As you have pointed out, the %bar% pattern guarantees a table-scan for every lookup, which nullifies any possible search ingenuity in the database software.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
4

It won't make a huge difference, given your problem is with the wildcard, but not using "SELECT *" will improve query performance. If you're not actually using all the fields you get back, that's a win and "SELECT *" causes two queries to fire, one to look up the fields for the table and then your query with the field names added in.

Tom
  • 22,301
  • 5
  • 63
  • 96