46

I'm trying to implement something like Google suggest on a website I am building and am curious how to go about doing in on a very large dataset. Sure if you've got 1000 items you cache the items and just loop through them. But how do you go about it when you have a million items? Further, suppose that the items are not one word. Specifically, I have been really impressed by Pandora.com. For example, if you search for "wet" it brings back "Wet Sand" but it also brings back Toad The Wet Sprocket. And their autocomplete is FAST. My first idea was to group the items by the first two letters, so you would have something like:

Dictionary<string,List<string>>

where the key is the first two letters. That's OK, but what if I want to do something similar to Pandora and allow the user to see results that match the middle of the string? With my idea: Wet would never match Toad the Wet Sprocket because it would be in the "TO" bucket instead of the "WE" bucket. So then perhaps you could split the string up and "Toad the Wet Sprocket" go in the "TO", "WE" and "SP" buckets (strip out the word "THE"), but when you're talking about a million entries which may have to say a few words each possibly, that seems like you'd quickly start using up a lot of memory. Ok, that was a long question. Thoughts?

Saheb
  • 1,392
  • 3
  • 13
  • 33
aquinas
  • 23,318
  • 5
  • 58
  • 81
  • 2
    Can you give some numbers about your data? How many string? Average length and average number of words? Which language? – Daniel Brückner Mar 25 '09 at 14:37
  • 2
    496,000 strings currently. Average length 14 characters. Average number of words 2.3. C#. Oh, and there are non-ascii characters like Л and И – aquinas Mar 26 '09 at 00:06
  • I've implemented an autocomplete solution, using docker compose (so that it is easier to follow), and the source code in openly available: https://lopespm.github.io/2020/08/03/implementation-autocomplete-system-design.html . Hope that helps – Pedro Lopes Aug 04 '20 at 10:40

7 Answers7

34

As I pointed out in How to implement incremental search on a list you should use structures like a Trie or Patricia trie for searching patterns in large texts.

And for discovering patterns in the middle of some text there is one simple solution. I am not sure if it is the most efficient solution, but I usually do it as follows.

When I insert some new text into the Trie, I just insert it, then remove the first character, insert again, remove the second character, insert again ... and so on until the whole text is consumed. Then you can discover every substring of every inserted text by just one search from the root. That resulting structure is called a Suffix Tree and there are a lot of optimizations available.

And it is really incredible fast. To find all texts that contain a given sequence of n characters you have to inspect at most n nodes and perform a search on the list of children for every node. Depending on the implementation (array, list, binary tree, skip list) of the child node collection, you might be able to identify the required child node with as few as 5 search steps assuming case insensitive latin letters only. Interpolation sort might be helpful for large alphabets and nodes with many children as those usually found near the root.

Saheb
  • 1,392
  • 3
  • 13
  • 33
Daniel Brückner
  • 59,031
  • 16
  • 99
  • 143
  • 2
    Trie works great for finding matches at the beginning of a string. However, with my current dataset the process of removing the first char and then inserting didn't end up working very well, just started using way too much memory: > 1 gig before it was half done with the dataset. – aquinas Mar 25 '09 at 13:17
  • May be a case of premature optimization, when I just ran a naive "contains" search, the runtime is less than 100 milliseconds. Lucene also looks really cool, so I might try that for fun. Another idea would be to use a combination of trie and naive search. – aquinas Mar 25 '09 at 13:20
  • Start with trie and if you have less than 20 starts with matches, fall back to naive. Why can I only insert 300 characters??! – aquinas Mar 25 '09 at 13:20
  • 1
    A standard Trie is very memory intensive, for larger sets you wanna use a Compacted Trie which greatly reduces the memory footprint. Additional optimisations encompass lazy initialisation of node values and the right data structures for the children/value sets. A while ago I created a [Java autocomplete library](https://github.com/fmmfonseca/completely) capable of handling very large data sets (10,000,000+) and efficiently answer exact and approximate searches. – Filipe Miguel Fonseca Jun 24 '15 at 08:02
  • I like @Daniel answer. – gramcha Jan 19 '17 at 08:28
11

Don't try to implement this yourself (unless you're just curious). Use something like Lucene or Endeca - it will save you time and hair.

Jim Arnold
  • 2,238
  • 14
  • 18
7

Not algorithmically related to what you are asking, but make sure you have a 200ms or more delay (lag) after the kaypress(es) so you ensure that the user has stopped typing before issuing the asynchronous request. That way you will reduce redundant http requests to the server.

cherouvim
  • 31,725
  • 15
  • 104
  • 153
2

I would use something along the lines of a trie, and have the value of each leaf node be a list of the possibilities that contain the word represented by the leaf node. You could sort them in order of likelihood, or dynamically sort/filter them based on other words the user has entered into the search box, etc. It will execute very quickly and in a reasonable amount of RAM.

rmeador
  • 25,504
  • 18
  • 62
  • 103
1

if you don't want a trie and you want stuff from the middle of the string, you generally want to run some sort of edit distance function (levenshtein distance) which will give you a number indicating how well 2 strings match up. it's not a particularly efficient algorithm, but it doesn't matter too much for things like words, as they're relatively short. if you're running comparisons on like, 8000 character strings it'll probably take a few seconds. i know most languages have an implementation, or you can find code/pseudocode for it pretty easily on the internet.

Ian Ooi
  • 1,635
  • 1
  • 15
  • 19
1

You keep the items on the server side (perhaps in a DB, if the dataset is really large and complex) and you send AJAX calls from the client's browser that return the results using json/xml. You can do this in response to the user typing, or with a timer.

Assaf Lavie
  • 73,079
  • 34
  • 148
  • 203
-1

I've built AutoCompleteAPI for this scenario exactly.

Sign up to get a private index, then, Upload your documents.

Example upload using curl on document "New York":

curl -X PUT -H "Content-Type: application/json" -H "Authorization: [YourSecretKey]" -d '{
"key": "New York",
"input": "New York"
}' "http://suggest.autocompleteapi.com/[YourAccountKey]/[FieldName]"

After indexing all document, to get autocomplete suggestions, use:

http://suggest.autocompleteapi.com/[YourAccountKey]/[FieldName]?prefix=new

You can use any client autocomplete library to show these results to the user.

Sean
  • 9
  • 1