Hash Table and Substring Matching

Question

I have hundreds of keys for example like:

redapple
maninred
foraman
blueapple

i have data related to these keys, data is a string and has related key at the end.

redapple: the-tree-has-redapple
maninred: she-saw-the-maninred
foraman: they-bought-the-present-foraman
blueapple: it-was-surprising-but-it-was-a-blueapple

i am expected to use hash table and hash function to record the data according to keys and i am expected to be able to retieve data from table.

i know to use hash function and hash table, there is no problem here.

But;

i am expected to give the program a string which takes place as a substring and retrieve the data for the matching keys.

For example:

i must give "red" and must be able to get

redapple: the-tree-has-redapple
maninred: she-saw-the-maninred

as output.

or

i must give "apple" and must be able to get

redapple: the-tree-has-redapple
blueapple: it-was-surprising-but-it-was-a-blueapple

as output.

i only can think to search all keys if they has a matching substring, is there some other solution? If i search all the key strings for every query, use of hashing is unneeded, meaningless, is it?

But, searching all keys for substring is O(N), i am expected to solve the problem with O(1).

With hashing i can hash a key e.g. "redapple" to e.g. 943, and "maninred" to e.g. 332.

And query man give the string "red" how can i found out from 943 and 332 that the keys has "red" substring? It is out of my cs thinking skills.

Thanks for any advise, idea.

Why are you "expected to use hashtable"? suffix tree will fit much better. — amit, May 10 '12 at 08:17
Also, you cannot do it in `O(1)` I believe, since for the string `""` (empty string) you will have to output the entire collection. Also, when talking about strings, reading a string is usually regarded as `O(|S|)` and not `O(1)` — amit, May 10 '12 at 08:22
@amit Will suffix trees help if the substring is in the middle of the key? — Adam Matan, May 10 '12 at 08:54
thanks for responses, it is not the first problem to consider the operation time inside string. @deathApril yes this is a homework but i do not want the solution. i want to know if the q was asked right. **i only wonder should i search all strings for given substring(query parameter/search input), or predict like an oracle if the hash codes tell me their key string has the substring** — merveotesi, May 10 '12 at 09:12
@AdamMatan: Sure it will - if there is a string `s` that `t` is a substring of `s`, then `t` will be a prefix - of some suffix of `s`. You can discover it easily by traversing the tree. — amit, May 10 '12 at 10:14
@amit , tahnks for your detailed comments, without suffix tree, using hash table is it possible to not to try all the keys first if the input substring takes place in them? — merveotesi, May 10 '12 at 11:00
You can always put all the substrings in a hashtable, pointing to a list of embracing strings containing them. — Adam Matan, May 10 '12 at 11:42

score 3 · Accepted Answer · answered May 10 '12 at 10:56

3

Possible you should use the invert index for n-gramm, the same approach is used for spell correction. For word redapple you will have following set of 3-gramms red, eda, dap, app, ppl, ple. For each n-gramm you will have a list of string in which contains it. For example for red it will be

red -> maninred, redapple

words in this list must be ordered. When you want to find the all string that contains a a give substring, you dived the substring on n-gramm and intercept the list of words for n-gramm.

This alogriphm is not O(n), but it practice it has enough speed.

answered May 10 '12 at 10:56

Alexander Kuznetsov

3,062
2
25
29

thanks very much for detailed answer, i understood you detail the first choice in my question, i must iterate over all the list of words for a given input. for a search input i must look to every word's n-grams. did i understand right? – merveotesi May 10 '12 at 11:09
1

Note that the list of all n-grams is `O(n^2)` space, and it will make each `insert()` and `remove()` op slower by a factor of at least `O(logn)`, since the relevant location in the index should be found first. – amit May 10 '12 at 11:21
hey somebody to answer me please, **there is no chance to not to look for the words in the key list which has the input substring, am i right?** I only wonder if there is a magical way to predict the hash codes of keys only looking at the input-search-substring? – merveotesi May 10 '12 at 11:37
**or is there a way to write a magic hash function that when i give the input-search-substring to it, it says me the locations of the key's which has the substring in them?** – merveotesi May 10 '12 at 11:53
tuxi first question Yes first you should build index of the words. When you search possibly you can the smallest set and iterate only through it or using same special algorithm for list intersections. – Alexander Kuznetsov May 10 '12 at 14:29
amit you can use only 2-gramm for words length is less when 3. And 3-gramm for other words. – Alexander Kuznetsov May 10 '12 at 14:30
Lookup is O(m) with m being the number of matches, because it is a lookup into a hash table. Building the index is O(n) or O(n²), because it is building a hash table of size proportional to n for each set of n-grams, where n is the number of characters in the strings in the input and there are n different sets of n-grams. – Wolfgang Brehm Feb 10 '20 at 00:06

score 3 · Answer 2 · answered May 10 '12 at 12:41

It cannot be nicely done in a hash table. Given a a substring - you cannot predict the hashed result of the entire string¹

A reasonable alternative is using a suffix tree. Each terminal in the suffix tree will hold list of references of the complete strings, this suffix is related to.

Given a substring t, if it is indeed a substring of some s in your collection, then there is a suffix x of s - such that t is a prefix of x. By traversing the suffix tree while reading t, and find all the terminals reachable from the the node you reached from there. These terminals contain all the needed strings.

(1) assuming reasonable hash function, if hashCode() == 0 for each element, you can obviously predict the hash value.

A suffix array is the better solution. You can do it with a hash table too however by making an index of all the substrings up to a certain length where they become virtually unique, similar to what was suggested in the accepted answer. — Wolfgang Brehm, Feb 10 '20 at 00:03

score -1 · Answer 3 · answered Feb 07 '20 at 01:47

-1

I have researched this problem recently and i'm sure that this can not be done. I hope hash table will help me improve speed of searching like you but it makes me disapointed.

answered Feb 07 '20 at 01:47

May'Habit

1,334
12
18

This can be done, it is the accepted answer. You make an index of all the substrings or possibly just the n-grams you are interested in. Then you lookup into this index. But that is only if you insist on using a hash table, a suffix array would be even more suitable. – Wolfgang Brehm Feb 10 '20 at 00:01
That would be too complex. With a table has about 1 million record, I'll have many million subtrings and this is bad. – May'Habit Feb 10 '20 at 15:03
1

Many million substrings is just a few megabytes, you realise that? – Wolfgang Brehm Feb 10 '20 at 16:50
1

my mistake, I'll research this problem – May'Habit Feb 11 '20 at 09:57

Hash Table and Substring Matching

3 Answers3

Linked