2

I'm trying to create a scrabble-playing bot. So I thrown all (2,7 mln) polish words to a SQL Server database and now working on creating patterns for regex queries. I have just found out that adding some conditions to WHERE clause could make search much more efficient.

As an example, executing query:

SELECT * FROM words WHERE dbo.[like](word, '^[def]{1,3}$') = 1;

lasts about 43 sec, but adding quite obvious length condition:

SELECT * 
FROM words 
WHERE dbo.[like](word, '^[def]{1,3}$') = 1 AND LEN(word) <= 3;

reduces execution time to 3 sec... could you tell me why, and advise some techniques that would help me make queries more efficient?

P.S. like function is CLR written in c#:

public static bool Like(string text, string pattern)
{
    Match match = Regex.Match(text, pattern);

    return (match.Value != String.Empty);
}
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Com Piler
  • 257
  • 5
  • 14
  • Try a **FULL TEXT** index on your table: http://msdn.microsoft.com/en-ca/library/ms187317(v=sql.90).aspx – Pieter Geerkens Aug 24 '14 at 18:28
  • 3
    If you have 2.7 million rows, SQL Server's query optimizer cannot possibly know which ones to check and which ones aren't going to match your regex. However, if you add `AND LEN(word) <= 3`, then the query optimizer can exclude all words of 4 characters or more, and thus apply the call to the regex checker to a **much smaller** set of words - that's why the query executes much faster – marc_s Aug 24 '14 at 18:34
  • 1
    One simple thing is to change the CLR function to use `Regex.IsMatch(text, pattern)`. However, this problem is probably better solved by holding all the words in memory and using your client lanaguage. In C#, you would then be able to use the `RegEx(pattern)` constructor to speed things up. You could also look at specialist data structures like tries – Laurence Aug 24 '14 at 18:56
  • @PieterGeerkens I don't think a CLR function is going to make any use of a full text index. – Laurence Aug 24 '14 at 18:59
  • @Laurence: Of course - it looks like OP is unaware of this capability and thus is over using the CLR and REGEX. The idea would be to try and use the full text indexing to reduce the pattern matching required. – Pieter Geerkens Aug 24 '14 at 19:35
  • There is nothing wrong with adding additional clauses that help SQL Server eliminate rows easier. This isn't like OO where DRY is the devil. – Aaron Bertrand Aug 24 '14 at 19:42
  • @PieterGeerkens as far as I know, fulltext is useful when searching fields that have a lot of characters inside. I am dealing with words, so used unique index – Com Piler Aug 25 '14 at 10:43
  • Is your index being used? – Pieter Geerkens Aug 25 '14 at 10:55
  • While using regex, obviously no. But sometimes it is used. [What makes a SQL statement sargable?](http://stackoverflow.com/questions/799584/what-makes-a-sql-statement-sargable) – Com Piler Aug 25 '14 at 11:38

2 Answers2

1

using clr and regex in sql is slow and there is not much you can do about it. my advice is to limit the amount of regex you need to run and try to reduce the amount of data you have to run the regex on, for example like you did in your second query.

the "show query plan" window is your friend for optimizing sql queries.

the built in sql LIKE operator is going to be much quicker but of course is also much more limited. the built in LIKE operator also have the added bonus of sometimes being able to use indexes.

you could possibly rewrite your query using sql LIKE into something like this:

SELECT *
FROM words
WHERE
 word LIKE '[def][def][def]'
 OR word LIKE '[def][def]'
 OR word LIKE '[def]'

but it is not pretty

you may also find this one interesting: What makes a SQL statement sargable?

Community
  • 1
  • 1
T J
  • 110
  • 6
  • Unfortunately in my case it is almost impossible not to use regex, becouse my project requires making much more complicated queries than I have given as example. The only other solution I see is to build loooong queries with charindex(), len(), like and others, but it does not seem to be efficient. But, as You said, I will try to use that functions to limit executions of regex. Thank You for good answer – Com Piler Aug 25 '14 at 10:22
  • 1
    i'm running into the same problem myself and i also had to use regex in many places because of their flexibility. to work around the performance issues i had to try to avoid using them as much as possible and in some cases do the regex lookup in advance and storing the result in a lookup table for quick access. you may also want to mark your clr function as deterministic and precise. deterministic means that for a given input the result is always the same, this will allow the query optimizer to cache the result in some cases and it also allows you to use indexed views with the function. – T J Aug 25 '14 at 20:35
0

Basing on @Laurence comment and @user3973227 answer I made the best solution so far. I tried putting my dicitionary into memory, and then filtered it with regex like that:

var command = new SqlCommand("SELECT word FROM words", con);
SqlDataReader reader = command.ExecuteReader();
Regex regex = new Regex("^[def]{1,3}$");
while (reader.Read())
{
   if(regex.IsMatch(reader.GetString(0)))
   GameOperations.Log(reader.GetString(0));
}

it did the job in... 2,6 seconds... when I added length condition to my query:

var command = new SqlCommand("SELECT word FROM words WHERE LEN(word) <= 3", con);

I got results in 0,09 sec!

It proves that this way of filtering table with regex is better than any CLR. Adding more conditions to query also speeds up the process, especially when they are sargable (capable of using indexes). Using Regex constructor is important too, in my case reduced the time of filtering twice.

Com Piler
  • 257
  • 5
  • 14