Optimising LIKE expressions that start with wildcards

Question

I have a table in a SQL Server database with an address field (ex. 1 Farnham Road, Guildford, Surrey, GU2XFF) which I want to search with a wildcard before and after the search string.

SELECT *
FROM Table
WHERE Address_Field LIKE '%nham%'

I have around 2 million records in this table and I'm finding that queries take anywhere from 5-10s, which isn't ideal. I believe this is because of the preceding wildcard.

I think I'm right in saying that any indexes won't be used for seek operations because of the preceeding wildcard.

Using full text searching and CONTAINS isn't possible because I want to search for the latter parts of words (I know that you could replace the search string for Guil* in the below query and this would return results). Certainly running the following returns no results

SELECT *
FROM Table
WHERE CONTAINS(Address_Field, '"nham"')

Is there any way to optimise queries with preceding wildcards?

As far as I know, SQL Server does not have a built-in index type that does what you want (Postgres does). There are ways of doing what you want, but they require a lot of work. — Gordon Linoff, Jan 26 '17 at 17:11
You're right in that indexes wouldn't be able to seek on that, but even an index scan with key lookups may be faster than a table scan if that's what it is doing now. You'd have to test it and see. — SqlZim, Jan 26 '17 at 17:16
A point of clarification: You can search for parts of words with fulltext indexes, but it must be the first part of the word. Using your string example, you could successfully leverage a fulltext index by searching for 'Guil', but you are correct in saying it will not work for 'nham'. — dfundako, Jan 26 '17 at 17:22
As a little point of technicality, if the full text index contained the string value both forwards and backwards, and you did a `contains` on both `'nham'` and `'mahn'` it would work. If you were looking for `'ldfo'` though you're out of luck. — iamdave, Jan 26 '17 at 17:25
SQL Server might still be able to use an index, only that it will use _index scan_ instead of index seek. — Salman A, Jan 27 '17 at 09:56

score 4 · Accepted Answer · answered Jan 26 '17 at 17:34

Here is one (not really recommended) solution.

Create a table AddressSubstrings. This table would have multiple rows per address and the primary key of table.

When you insert an address into table, insert substrings starting from each position. So, if you want to insert 'abcd', then you would insert:

abcd
bcd
cd
d

along with the unique id of the row in Table. (This can all be done using a trigger.)

Create an index on AddressSubstrings(AddressSubstring).

Then you can phrase your query as:

SELECT *
FROM Table t JOIN
     AddressSubstrings ads
     ON t.table_id = ads.table_id
WHERE ads.AddressSubstring LIKE 'nham%';

Now there will be a matching row starting with nham. So, like should make use of an index (and a full text index also works).

If you are interesting in the right way to handle this problem, a reasonable place to start is the Postgres documentation. This uses a method similar to the above, but using n-grams. The only problem with n-grams for your particular problem is that they require re-writing the comparison as well as changing the storing.

score 3 · Answer 2 · answered Jan 26 '17 at 18:07

I can't offer a complete solution to this difficult problem.

But if you're looking to create a suffix search capability, in which, for example, you'd be able to find the row containing HWilson with ilson and the row containing ABC123000654 with 654, here's a suggestion.

  WHERE REVERSE(textcolumn) LIKE REVERSE('ilson') + '%'

Of course this isn't sargable the way I wrote it here. But many modern DBMSs, including recent versions of SQL server, allow the definition, and indexing, of computed or virtual columns.

I've deployed this technique, to the delight of end users, in a health-care system with lots of record IDs like ABC123000654.

score 1 · Answer 3 · edited May 23 '17 at 11:53

Not without a serious preparation effort, hwilson1.

At the risk of repeating the obvious - any search path optimisation - leading to the decision whether an index is used, or which type of join operator to use, etc. (independently of which DBMS we're talking about) - works on equality (equal to) or range checking (greater-than and less-than).

With leading wildcards, you're out of luck.

The workaround is a serious preparation effort, as stated up front:

It would boil down to Vertica's text search feature, where that problem is solved. See here:

https://my.vertica.com/docs/8.0.x/HTML/index.htm#Authoring/AdministratorsGuide/Tables/TextSearch/UsingTextSearch.htm

For any other database platform, including MS SQL, you'll have to do that manually.

In a nutshell: It relies on a primary key or unique identifier of the table whose text search you want to optimise.

You create an auxiliary table, whose primary key is the primary key of your base table, plus a sequence number, and a VARCHAR column that will contain a series of substrings of the base table's string you initially searched using wildcards. In an over-simplified way:

If your input table (just showing the columns that matter) is this:

id    |the_search_col                           |other_col
    42|The Restaurant at the End of the Universe|Arthur Dent
    43|The Hitch-Hiker's Guide to the Galaxy    |Ford Prefect

Your auxiliary search table could contain:

id   |seq|search_token
   42|  1|Restaurant
   42|  2|End
   42|  3|Universe
   43|  1|Hitch-Hiker
   43|  2|Guide
   43|  3|Galaxy

Normally, you suppress typical "fillers" like articles and prepositions and apostrophe-s , and split into tokens separated by punctuation and white space. For your '%nham%' example, however, you'd probably need to talk to a linguist who has specialised in English morphology to find splitting token candidates .... :-]

You could start by the same technique that I use when I un-pivot a horizontal series of measures without the PIVOT clause, like here:

Pivot sql convert rows to columns

Then, use a combination of, probably nested, CHARINDEX() and SUBSTRING() using the index you get from the CROSS JOIN with a series of index integers as described in my post suggested above, and use that very index as the sequence for the auxiliary search table.

Lay an index on search_token and you'll have a very fast access path to a big table.

Not a stroll in the park, I agree, but promising ...

Happy playing -

Marco the Sane

Optimising LIKE expressions that start with wildcards

3 Answers3

Linked