2

I have a table with 40+ million rows in MS SQL Server 2019. One of the columns store pure hexadecimal strings (both binary and readable ASCII content). I need to search this table for rows containing a specific hex string.

Normally, I would do this:

SELECT * FROM transactionoutputs WHERE outhex LIKE '%74657374%' ORDER BY id DESC OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY;

Since the results are paginated, it can take less than a second to find the first 10 results. However, when increasing the offset, or searching for strings that only appear 1-2 times in the entire table, it can take more than a minute, at which point my application will time out.

The execution plan for this query is this: execution plan

Are there any easy ways to improve the performance of such a search?

Using this answer, I was able to reduce the query time from 33 seconds to 27 seconds:

SELECT * FROM transactionoutputs WHERE
CHARINDEX('74657374' collate Latin1_General_BIN, outhex collate Latin1_General_BIN) > 0
ORDER BY id DESC OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY;

When I leave out the ORDER BY and pagination, I can reduce this to 19 seconds. This is not ideal because I need both the ordering and pagination. It still has to scan the entire table

I have tried the following:

  • Create an index on that column. This has no noticeable effect.
  • I came across this article about slow queries. Initially, I was using parameterized queries in my application, which was much slower than running them in SSMS. I have since moved to the query shown above, but it is still slow.

  • I tried to enable Multiple Active Result Sets (MARS), but without any improvement in query time.

I also tried using Full-Text Search. This seemed to be the most promising solution as text search is exactly what I need. I created a full-text index and can do a similar query like above, but using the index:

SELECT * FROM transactionoutputs WHERE CONTAINS(outhex,'7465') ORDER BY id desc OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY;

This returns results almost instantly. However, when the query is longer than a few characters (often 4), it doesn't return anything. Am I doing something wrong or why is it doing that?

The execution plan:

execution plan full-text search

My understanding is that my case is not the ideal use case for FTS, as it is designed to search in readable text and not hexadecimal strings. Is it possible to use anyway, and if so, how?

After reading dozens of articles and SO posts, I can not confidently say I know how to improve the performance of such queries for my specific use case, if it is even possible at all. So, is there any easy option to improve this?

Johannes Mols
  • 890
  • 1
  • 12
  • 35
  • You're using a leading wild card here, that's the *real* problem. An Index can't be used when you're using clauses like `LIKE '%74657374%'` or wrapping a column with `CHARINDEX` in the `WHERE`. *If* the value you're looking for (`'74657374'`) is it's own "word" in the column, then a Full Text Index would help you. If, however, it is part of the string (i.e. `'9873A74657374987D'`) and would could appear in any part, then it won't help you. If specific parts have meaning in your string, you should be sperating them into different columns. Perhaps that's what you *really* need to do. – Thom A Apr 20 '20 at 14:36
  • @Larnu Yes that is exactly my problem. Unfortunately, it is hexadecimal data, of which more than half of it is binary data and can not be converted to readable text. I was hoping there would be some kind of solution, but it appears that there is not. I could potentially try to convert the rows that are partially convertible and store those in a new column, so I can at least search the readable parts. This wouldn't be a complete solution though. – Johannes Mols Apr 20 '20 at 14:39
  • Such are the issues of using a "generic" storage solution - a one-size-fits-all approach with a single column. Simplifying the design/input side often leads to much more significant issues when trying to use the information once inside the database. – SMor Apr 20 '20 at 14:46
  • @SMor I agree with you. Unfortunately this is the nature of the data that I am working with. I will try to create another column with the converted version and do a full-text search on that, as Larnu suggested. – Johannes Mols Apr 20 '20 at 14:49

1 Answers1

1

First, Kudo's for the fantastic explanation of your problem. This helps you get better answers fast. You should also include DDL, including indexes when possible. This will be come clear as I answer your question.

I'm going to tackle a couple issues with your query which are unrelated to how your parsing your text now and talk about how to handle the string problem later tonight.

Answer Part 1: Unrelated to string parsing

It's quite possible that the way you are searching through the string is the main performance problem. Let's start with the SELECT * - do you absolutely need all the columns? Specifically, do you absolutely need all the columns that are included in that Key lookup? This is the most important thing to sort out. Let me explain.

You're query is performing a scan against your nonclustered index named outhex-index, then performs a Key lookup to retrieve the rows not included in outhex-index. Key lookups destroy performance, especially against a clustered and nonclustered index with 40,000,000 rows.

If you do need those columns, then you should consider adding them as included columns to your outhex-index index. I say consider because I don't know how many columns nor the data type. Include columns speed up queries by eliminating costly key lookups but they slow down data modification, sometimes dramatically depending on the number/type of indexes. If you need the columns not included in outhex-index and they are big columns (MAX/BLOB/LOB data types, XML, etc) then a covering index is NOT an option. If your don't need them then you she refactor your SELECT * statement to only include the columns you need.

Full text indexing is not an option here unless you can find a way to lose that sort. Sorting has an N log N complexity which means the sort gets more expensive the more rows you sort. A 40 million row sort should be avoided whenever possible. This will be hard to avoid with full text indexing for reasons which require more time to explain then I have time for. Adding/modifying a 40-million row index can be expensive and take a lot of time. If you do go that route I suggest taking an offline copy of that table to time how long it takes to build. You can also consider adding creating a filtered index if possible to reduce your search area.

I noticed, too, that both queries are getting serial execution plans. I don't know if a parallel plan will help the first query with the key lookup but I know that it will likely help with the second one as there is a sort involved. Parallel execution plans can really speed up sorts. Consider testing your query with OPTION (QUERYTRACEON 8649) or make_parallel() by Adam Machanic.

I'll update this post tonight with some ideas to parse your string faster. One thing you could look into in the meantime is Paul White's clever Trigram Wildcard String Search trick which might be an option.

Alan Burstein
  • 7,770
  • 1
  • 15
  • 18
  • Thank you for the elaborate answer. Yes I do need all the columns of the results. It is 10 columns with either integers or short strings, this should hopefully not be a problem, but I refactored my statement to leave out the ID at least. Are you certain that the Key Lookup is the problem, as the estimated execution plan says the index scan is the expensive operation, as that has to go through all 40 million rows. I ended up creating a new column with the hexadecimal strings converted to ASCII where possible, and perform a LIKE '%%' search on that, which performs much better for some reason. – Johannes Mols Apr 21 '20 at 10:52
  • Also, how can you see that the query is performed in serial? I will try to add the option you have mentioned, thank you for your help. – Johannes Mols Apr 21 '20 at 10:53