3

I'have a question for you. I have to implement a business address book that contains 30000 names. All the names contains firstname and lastname. I have to implement an autocomplete textbox that search not only typing firstname but also by lastname. Searching on google I've seen that this problem is solved using a patricia trie but it does only prefix search so if I create a trie with firstname+lastname how can I search not only by firstname but also by lastname?

Do I have to duplicate entries inserting two strings like this? Firstname+lastname And Lastname+firstname

Please help me!!!

The search have to be very efficient.

Thanks.

Mapo
  • 115
  • 1
  • 11

2 Answers2

2

Another possibility is creating two tries.

The first one (let it be T1) is for first names, and the second (let it be T2) for last names.

When you construct the trie, from each word terminator in T1 (usually denoted as $ sign), add a list of pointers to the relevant entries in T2, and vise versa.

I.E. if John Doe is an entree:

T1:
     J
     |
     O
     |
     H
     |
     N
     |
     $1
T2:
     D
     |
     O
     |
     E
     |
     $2

$1 will hold a list, containing pointer to $2, and $2 will hold a list, containing $1.

each prefix search will search on both tries, getting you the auto completion, and then use the pointers to get the full name (the partial search gave you only first/last name, you get the second using the pointers).

Searching for full name is done by searching in both tries (look for the first name in T1 and for the last name in T2, and get the relevant $1 and $2 respectively), then you need to check if the pointers match (the list l1 in $1 contains $2 and the list l2 in $2 contains $1). If they do - the name is in the dictionary.

Note that once you have a pointer to the $ node, one can simply go back on the trie, until you get to the root to get the word this $ sign is representing. (requires pointer to parent from each node)

Also note: I explained about simple tries, but there is really no reason why not to use patricia tries instead, using the same approach.

amit
  • 175,853
  • 27
  • 231
  • 333
  • Ok, thanks for your answer. I have to study it. One question. Searching on two different tries is efficient? How about performance? Consider that this structure have to implemente on server side! Thanks – Mapo Jun 06 '12 at 21:38
  • @user788779: Searching in the two tries is not less efficient then searching a single one in this case, it might even be better because it can be parallelized - which might be helpful for huge strings (though it is seldom the case). The only slow-down in this approach is matching the list of pointers once you found `$1` and `$2`. – amit Jun 06 '12 at 21:40
  • Ok. I've read a possible solution could be to use a permuterm index for wildcard search. According to you this solution could help me? – Mapo Jun 06 '12 at 21:44
  • 1
    @user788779: I think it could help you, but note it will require double the space as this approach. I think time consumption is similar to both approaches. – amit Jun 06 '12 at 21:48
  • I would not be too concerned about finding the perfect solution -- you should be fine as long as it is efficient. – Stefan Haustein Jun 06 '12 at 21:53
  • @all: I have an idea. Could you say me if it's valid? I create a trie cointaining firstname+lastname;user_id and lastname+firstname;user_id On autocomplete to show only one record per person I can control if that user_id exists and do not show it. Could it work? – Mapo Jun 06 '12 at 21:57
  • 1
    @user788779: The problem with this approach: If you type `"Joh"`, the auto complete will find it might refer to `"John"`, but it will not be able to connect it to `"Doe"`, since you cannot search by id in the suggested data structure. – amit Jun 06 '12 at 22:01
  • Ok, but if I search for Doe I'll find doe john! – Mapo Jun 06 '12 at 22:11
  • But if there's an enty with names marco marchi. If I search for marc I'll find two results marco marchi and marchi marco, but Client side I'll shown only one of them beacuse I have the Id. Is it right? Is it possible to use this approach? Is it efficient? – Mapo Jun 06 '12 at 22:14
0

Yes, the simplest solution is to insert both variants. However, this should only duplicate the search string, not the entry. You probably want to normalize the separation between first and last name somehow (=remove punctuation chars for the address book and for the user input), so you'll find the entries in all cases for input such as "John Doe", "Doe, John", "Doe John" etc.

I would not use a partial trie but just a balanced tree. In many languages, you'll find balanced trees as the sorted map implementation in the library (at least Java and C++).

Stefan Haustein
  • 18,427
  • 3
  • 36
  • 51
  • Thanks for you answer!! But when I search a string it's possibile to obtain two records tha represent the same person! For example marco marchi. So if I search for marc I obtain two records: marco marchi and marchi marco. So what to do? – Mapo Jun 06 '12 at 21:23
  • How can a balanced tree give him partial match? Also note that balanced trees are less efficient - assymptotically speaking for searching existence of strings. – amit Jun 06 '12 at 21:29
  • You could also add parts of the Address or the birth date to the key, ideally something that helps the user to select the right entry. To make sure you have a unique key and you don't need a list as value, also append a unique record id. You may want to hide the id from the user. – Stefan Haustein Jun 06 '12 at 21:30
  • @amit: To get a partial match in a tree, start iterating over the keys at the prefix until you find an entry that does not match the prefix. Tries may have a better performance in theory, but there is a reason why most standard libraries don't have them: Memory consumption is horrible, and the data structure will be scattered over a huge memory area, causing cpu cache issues. – Stefan Haustein Jun 06 '12 at 21:36
  • Balanced tree vs patricia trie? Which is the best for my problem? Thanks – Mapo Jun 06 '12 at 21:39
  • @user788779: It is extremely hard to tell, you will probably have to benchmark both on the specific system to get a clear conclusion. – amit Jun 06 '12 at 21:43
  • Unless you value the implementation time :) – Stefan Haustein Jun 06 '12 at 21:44