2

I am increasingly aware of the importance of making my queries efficient. It is crucial that I have the proper indexes etc. to make sure that my queries don't take up any more IO than is really necessary. But here is a query that is just ugly and I don't know how to make it efficient.

Let's say I have a table for stock items that looks something like this in its most basic form:

CREATE TABLE StockItems (
  ItemID INT IDENTITY (1, 1) NOT NULL PRIMARY KEY,
  SerialNo VARCHAR (50) NOT NULL
);

Now I need to write a stored proc that will return all stock items for which the serial number matches a partial serial number provided. The following query would achieve exactly that:

SELECT * FROM StockItems WHERE SerialNo LIKE '%' + @SearchStr + '%'

But when I see a query like that I break out in cold sweat and clutch my comfort blanket tighter. There's no index on SerialNo and even if there were, it would be of no help considering that I'm searching for a partial match.

Is there a way to achieve this task with queries that are more efficient?

Dewald Swanepoel
  • 1,651
  • 4
  • 15
  • 38
  • I'm afraid there's no other way of doing this (that would be more efficient) if you want it to be a partial match. – NickyvV May 20 '15 at 09:37
  • 2
    this might help: http://stackoverflow.com/questions/7510646/like-vs-contains-on-sql-server – Tanner May 20 '15 at 09:48
  • Thanks @NickyvV, I was afraid that'd be the answer. But I'm still perplexed. Surely this is not an entirely unusual requirement. Now I'm sure the big boys like Amazon have much more complicated servers with distributed storage and workload but how do they do it? If I search for a partial book title, do they also have to scan through their entire stock pool? – Dewald Swanepoel May 20 '15 at 09:48
  • 1
    There's many ways to optimize such a query but not really all that many that involves just tucking some kind of built-in index type on the table. You could for instance break out a book title into all the word, remove noise-words (the, and, or, etc.) and store a separate table containing those words and link them to all book titles containing those words. Then you would first do a query against those rows, and then use the LIKE expression to get a correct partial match. – Lasse V. Karlsen May 20 '15 at 10:03
  • 1
    Those scenarios are certainly valid, if you are able to implement a `Full-text Index` you could improve query performance: https://msdn.microsoft.com/en-us/library/ms142571.aspx?f=255&MSPPError=-2147217396 and https://technet.microsoft.com/en-us/library/ms187787(v=sql.105).aspx – NickyvV May 20 '15 at 10:04
  • 1
    Amazon and folks usually use full-text indices - those are not the same as searching for an exact partial match (e.g. you can find "tree" in "caring for trees", but not "tre" or "car"). You can add your own partitioning if you can think of any - but `like '%XXX%'` is pretty much impossible to index in any meaningful way. And index could still help the I/O - instead of going through all the data in the table, it would only have to scan through the index - worthwhile if the index is much smaller than the table. But that's about it. `%XX` and `XX%` are both fine, but `%XX%` is crazy. – Luaan May 20 '15 at 11:19
  • And the truth is, people usually want either `%XX` or `XX%` (or both - still a far cry from `%XX%`) - if there's a reason to search for something in the middle, it usually means that your column contains multiple values, and should be split into multiple columns. By far the most often people will be entirely happy with searching by `XX%` for things like serial numbers - you just don't want to type out the whole number by hand, but you do actually have it. If you have prefixes or something like that, that's exactly the kind of thing you can put into a separate column. – Luaan May 20 '15 at 11:23

1 Answers1

5

The problem of searching for words in text is quite different from the problem of searching quickly for substrings in a serial number. Inverted indexes, for instance, are an integral part of text analysis, but it is less obvious to see how these apply to your problem.

Your question is how is this done in practice. One simple method is hardware. In a database environment, this would be pegging a table containing the serial numbers into memory (pretty much happens by default if the table gets used a lot and is not too big) and then doing the scan. The scan can go faster with more threads/processors, so throwing hardware at the problem helps.

Writing custom code (which would be the case at an Amazon or Google) would make the process even faster.

There are other solutions. If the strings are largish, then you can use n-grams to encode the strings. N-grams (in this case) are sequences of characters, say 3 long, that appear in the string (more information).

For each serial number you break out all sequences of n-grams and store these in a big, indexed table. Then you look for all the n-grams in the search string. First, get only the serial numbers that have the same set of n-grams. Then do the brute force search on that.

For instance, if you had the serial number '1234567890', then it has the following trigrams: 123, 234, 456, 567, 678, 789, 890

If you are searching for %1919% then any matching string needs to have 191 and 919. The first string has neither, so you don't need to search it.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786