1

I am working in a project where I have millions of records in a SQL table. Table schema could be for example:

Site Id varchar Primary Key

Site Name varchar

Site Url varchar

Owner varchar

Details varchar

...

Now the scenario is to provide search on Site name and Site Url on UX. For example say here are the sites displayed on UX:

**Site Name              Site Url        Site Owner**
NishantBlog1    https://baseurl.com/nisblog1 Nishant Saxena
NishantBlog2    https://baseurl.com/nisblog2 Nishant Saxena
RamBlog3        https://baseurl.com/nisblog3 Nishant Saxena

Now say somebody searches for "is" then we need to give all the above three sites as every URL contains "is".

We can of course support it using LIKE %is% in this example but it won't be that much efficient as it will use index scan instead of index seek for obvious reasons even if create index on URL. Is there a way to improve the performance?

I search around it and found SQL Full Text index but it won't work in our scenario as it could work in nisblog3 type of search but not for "is" as there is no separator here.

Please note that I have also looked at SQL materialized path but I can't go with it even if it helps with performance (did not test) as it is going to increase the size of the table exponentially. Please correct me if I am wrong.

Because of large size and other issues I can't have a in memory trie and note that elastic search is not an option for us.

Let me know what is the best thing I can do here. The web service is written in C# and I can use storage procs too.

Dale K
  • 25,246
  • 15
  • 42
  • 71
  • Having a leading wildcard isn't SARGable. Indexes on the column itself *may* help as then it'll have the smallest index to scan (and it will scan *every* value), but otherwise there's not much more you can do. – Thom A Jun 09 '20 at 08:14
  • If your requirement is that you need to search across two columns with a leading wildcard on a table with millions of rows then your query is very likely to return millions, or at least hundreds of thousands, of rows in many cases ('tt', 'ps' for starters...). This cannot be useful to any user and will almost inevitably lead to terrible performance on a relational database. You need another means of filtering the results in conjunction with or instead of this. – strickt01 Jun 09 '20 at 08:16
  • Yes, tt, ss or in my example say "is" is not useful but what if somebody just say blog and now blog could be anywhere in url and can be useful for user to search using that keyword – Nishant Saxena Jun 09 '20 at 09:54
  • If you wanted to return all blogs then you would be better off holding the type of site in a separate indexed column and filtering using that (if there are enough distinct types of site to make the index worthwhile). Even then however you are still looking at a lot of rows being returned. As it stands you are going to have to perform a table or index scan across a large table every time you want to run the query. And the amount of records that would be returned each time would seem to make the requirement questionable from both a database and UX perspective. – strickt01 Jun 09 '20 at 13:25
  • @strickt01 Yes that's the problem we have right now and I am searching for the answers on how to make it efficient. Can Pagination be useful here? Like just return say return top 100 and then as user scroll we can go to 101 to 200 and so on. "Blog" was just an example. It could be anything which user thinks its unique for them to search which we don't know in advance. – Nishant Saxena Jun 09 '20 at 15:01
  • You need to find a means of filtering the data that is not just a scan of two columns if you want efficiency. If users are searching for websites then can the data be segmented in that way? i.e. filtering on their ID or on some form of group ID. Anything to which you can apply a variegated index which can be used prior to the `LIKE` predicates. – strickt01 Jun 09 '20 at 15:09
  • RE. pagination - would help in certain circumstances but if you search for a value that does not return the amount of rows in a page then you end up scanning the full table. The same is true as you increase the page number - the read count gets higher and higher. You need to be using an index that dramatically reduces the potential amount of rows that need to be read when you apply your `LIKE`. – strickt01 Jun 09 '20 at 15:37
  • @strickt01 Can Parallelization help here? I think so. Say if I can have multiple tasks which fires Select query in parallel using OFFSET and FETCH? – Nishant Saxena Jun 10 '20 at 11:49
  • Then you are just running multiple slow, resource-intensive queries in parallel? If you really want to support the current requirement you need an index. You could create a separate table with a clustered index containing all the substring variations for each row in your current table. So when a record is inserted/updated you call a function that returns every possible substring in the columns you need to search. Then associate these to the record in your site table. Your query would then do a clustered index seek on this table for the input string and join to the site table for the results. – strickt01 Jun 10 '20 at 13:51
  • For now I am using another table in which I am storing the sub-strings(at least size of 4) of the URL and mapping with actual URL. Its working for now. Lets see what happened. Storage is not a problem as for now. Thanks @strickt01 a lot for all the suggestions. – Nishant Saxena Jul 06 '20 at 13:33

0 Answers0