2

I have the following

public class SearchResult
{
    public string Description { get; set; }
    public int Year{ get; set; }
    public int Type { get; set; }
}

I create a list of these List and cache it then I try to search through this collection (1.2M) records with following

var found = myList.Where(x => x.Description.StartsWith(query)).Take(10).ToList();

This is pretty slow what I am going for, is there a better way to store a list of objects and have the ability to search the string property of the object?

Should I be sorting the collection before caching it? I would like to have the ability to do .StartsWith and .Contains on Description property with the fastest path to get top 10 matches.

If i just go to the db its faster (i have put a index on the text field), I am hoping to improve my performance by getting the results once, sticking them in memory and then all searches are done against the cache in memory vs going to the db each time. But this is proving to be slower then the db calls using the SQL LIKE '{query}%' statement

Zoinky
  • 4,083
  • 11
  • 40
  • 78
  • Because like 'query%' is sargable and it uses index to search. You can too create some tree from data to search. But it will be slow if you just want to do it once. Look for b-tree. It is how the index is stored. – Giorgi Nakeuri Nov 21 '15 at 05:20
  • The `Description` property contains a single word or a phrase? How about query? – Igor Bendrup Nov 21 '15 at 10:27
  • @IgorBendrup they r movie titles, could be a word or phrase – Zoinky Nov 21 '15 at 17:21

3 Answers3

1

String comparisons are inherently slow, additionally you have to iterate the entire list a full time to see if there is a match. This is never going to perform well, and in fact will more than likely get worse as time goes on as new records are added to the source.

Here is a nice article on string searching for those concerned with speed.

I would suggest doing just as you mentioned, move the searching to the database and limit the number of returned rows. While this is still I/O, databases are optimized for handling this sort of thing. Some other advantages are that you end up not having the pitfall of having your app crash and losing that cached searches, likewise you can leverage async/await which will make you app more responsive.

If you decide that you still want to go the route of pulling everything into memory and then querying the objects, good luck. The only other suggestion I have would be to consider search caching, such that if someone searches for the same thing relatively recently - you could cache those results and immediately return them.

From the same author, here is another source for reading - here he compares collection string lookup speed.

http://cc.davelozinski.com/c-sharp/fastest-collection-for-string-lookups

David Pine
  • 23,787
  • 10
  • 79
  • 107
  • It's weird that these links don't even mention Boyer-Moore or Knutt-Morris-Pratt, which are probably much more efficient that all these other methods (presuming you need to do many lookups and can spare a couple of instructions on preprocessing the search string). – vgru Nov 21 '15 at 19:43
  • Do you have an specific examples to share? – David Pine Nov 21 '15 at 19:45
  • Nothing specific, implementations can be found on Google. Just wanted to say the list is not nearly complete (I didn't think the author tried using a compiled regex, for example). And the second link only compares built in ways for searching an *exact* string within a collection (where using a hashset is a no-brainer), which is not applicable to many problems where you need a partial match. – vgru Nov 21 '15 at 20:06
  • +1, I didn't realize OP needs `Contains` along with `StartsWith`, in which case your answer makes much more sense. – vgru Nov 22 '15 at 09:05
1

First of all - it's an information retrieval task, and it seems like there isn't efficient way to do it using only LINQ.

What you need is some kind of reverse index. Very simple implementation of reverse index in your case is a Dictionary<string, List<SearchResult>>. For each word in a movie titles this Dictionary contains all movies, which title contains that word. To build this simple reverse index you can use following code:

        reverseIndex = new Dictionary<string, List<SearchResult>> ();
        for (int i = 0; i < searchResults.Count; i++) {
            var res = searchResults[i];
            string[] words = res.Description.Split (new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
            foreach (var word in words) {
                if (!reverseIndex.ContainsKey (word))
                    reverseIndex [word] = new List<SearchResult> () { res };
                else if (!reverseIndex[word].Contains(res))
                    reverseIndex [word].Add (res);              
            }
        }

Now, instead of slow:

searchResults.Where(x => x.Description.Contains(query));

you can use simple:

reverseIndex[query];

It works very fast. Instead of

searchResults.Where(x => x.Description.StartsWith(query));

you can use:

reverseIndex[query].Where(s => s.Description.StartsWith(query));

If your query contains more then one word, you can split it to words, then extract List<SearchResult> for each word, and then intersect lists.

With this simple implementation of reverse index your query can contain only whole words. If you want to search by a part of word, you need to use permuterm index. One possible implementation on C# you can find here. Note, that permuterm index need a lot of additional memory.

Igor Bendrup
  • 2,637
  • 2
  • 16
  • 15
  • Why do you think `query` won't be a single letter? To me this looks like server side autocomplete functionality, and dictionary is not the appropriate data structure for this. – vgru Nov 21 '15 at 23:26
  • OP says: "I would like to have the ability to do ... `.Contains` on Description property...". But trie allows only `.StartsWith` queries, doesn't it? So, I recomended to use permuterm index for wildcard queries. – Igor Bendrup Nov 21 '15 at 23:28
  • Anyway, +1 for mentioning permuterm indexes, which would handle both `StartsWith` and `Contains` cases well, although it requires a fairly large index tree for storing all possible permutations. But permuterm is what answers OP's question, while a plain dictionary approach fails at anything but whole word lookups, so you might want to rearrange your question to emphasize that. – vgru Nov 22 '15 at 08:52
1

Fast string prefix searches are best done using a trie data structure. What's cool about the trie is that all descendants of any given node have a common prefix of the string associated with that node. It can also be compacted into a radix tree, which is perhaps slightly more complex to implement.

Right now you are using Linq-to-objects to iterate through the entire list for every search (and each StartsWith method is O(m) with m being the length of the query string). If you used Linq-to-SQL, it would get translated to an SQL query which would use indexes to perform a more efficient lookup.

This link has an example implementation of the auto-complete functionality using a trie.

(Update)

As @David mentioned in the comments, a trie might be an overkill if you already have this data loaded in a list (that is, if you need to keep it in this form anyway, which you probably do). In that case, a better alternative for StartsWith queries would be to have the list sorted. That would allow you to use binary search to get your results in O(m log n).

Depending on whether the data is changed often, you might also use a balanced binary tree for storing the data, to allow you fast insertion/deletion (which is essentially what SortedDictionary gives you).

But ultimately, if you also need Contains queries, then you will either need to spare much more memory on indexing (like described by @Igor in his answer), or simply let your DMBS do it (suggested by @David).

On the other hand, you might want to try using SQL Server's full text search. Or, if you are willing to go a bit outside of writing SQL, Lucene.Net with in-memory caching should give you even better performance.

Community
  • 1
  • 1
vgru
  • 49,838
  • 16
  • 120
  • 201
  • Here is [another autocomplete example](http://igoro.com/archive/efficient-auto-complete-with-a-ternary-search-tree/) with a couple of nice pictures and the source code. – vgru Nov 21 '15 at 19:48
  • The issue with this answer is that the TRIE approach is really going to only add overhead, i.e.; more objects in memory. It was stated that there are 1.2+ million records, as such pulling all of those into memory plus the extra data structures of the TRIE will only worsen performance. http://upcommons.upc.edu/bitstream/handle/2099.1/16165/78135.pdf?sequence=1 – David Pine Nov 21 '15 at 20:35
  • @David: of course it's going to add memory overhead, that's what you are supposed to use your server's memory for. But worsen performance? How exactly? OP's current algorithm has a runtime complexity of `O(n*m)`, where `n` is the number of records (1.2+ million of them), and `m` is the length of the query string. He needs to iterate through 1.2 million items for each query string. Now compare this to a trie, which needs at most `m` comparisons to get to the correct node. Not only using a trie will not *worsen* performance, it will improve it by **several orders of magnitude**. – vgru Nov 21 '15 at 20:56
  • my point is being misconstrued. I challenge you to create a SQL DB with 1.2 million records where there is a strong field that has a varied length. Try performance testing both ways, i.e.; let SQL do the look up vs TRIE data structure. I know which is more performant but would love to be enlightened. – David Pine Nov 21 '15 at 21:03
  • I am not sure what your point is, perhaps I misunderstood, but please clarify it. RDBMS are definitely smart, and with right indexes can perform incredibly well. They also utilize memory efficiently. But will an in-memory trie return a prefix search faster than a `%like%` query ran against a vanilla SQL Server instance? If implemented reasonably, **yes**. Just like a .NET in-memory dictionary will yield faster lookups than SQL queries by an indexed column. Just like a simple binary search on a sorted `List` will run faster than a sargable SQL query. – vgru Nov 21 '15 at 23:19
  • But, if the data already has to be loaded in memory in its current form, then OP could indeed skip the whole trie part, keep the existing list sorted by name, and use binary search to get results in `O(m log n)`. – vgru Nov 21 '15 at 23:25
  • Trie is very efficient for queries with trailing wildcard ('query%'). But how it can be applied for queries with leading and trailing wildcards (''%query%')? – Igor Bendrup Nov 21 '15 at 23:33
  • @Igor: you are right, I didn't notice this requirement at all, but this tiny detail makes my answer much less relevant. :) – vgru Nov 22 '15 at 08:52