1

I have an array with so much strings and want to search for a pattern on it. This pattern can have some "." wildcard who matches (each) 1 character (any).

For example:

myset = {"bar", "foo", "cya", "test"}

find(myset, "f.o") -> returns true (matches with "foo") 
find(myset, "foo.") -> returns false 
find(myset, ".e.t") -> returns true (matches with "test")
find(myset, "cya") -> returns true (matches with "cya")

I tried to find a way to implement this algorithm fast because myset actually is a very big array, but none of my ideas has satisfactory complexity (for example O(size_of(myset) * lenght(pattern)))

Edit:

myset is an huge array, the words in it aren't big. I can do a slow preprocessing. But I'll have so much find() queries, so find() I want find() to be as fast as possible.

Murilo Vasconcelos
  • 4,677
  • 24
  • 27
  • One wildcard per pattern or more? – Dr. belisarius Feb 02 '11 at 00:38
  • Is the set fixed? You could build a trie out of it and match patterns to the trie. – Fred Foo Feb 02 '11 at 00:48
  • What language? Could you use an existing regular expression library? – Justin Morgan Feb 02 '11 at 00:53
  • @belisarius the third example shows two wildcards in the same pattern. – Murilo Vasconcelos Feb 02 '11 at 00:58
  • Thanks. I got confused because you wrote _This pattern can have **a** "."_ – Dr. belisarius Feb 02 '11 at 01:01
  • @larsmans: I thought that, but for matching "........" and this string have length greater than the longest string in the set, the search will be in order of the size of the set (the set is huge). I also thought about having a trie by length but it also do not helps, The branching factor is high. – Murilo Vasconcelos Feb 02 '11 at 01:01
  • @belisarius: Sorry, I misspelled! Thank you the advice! – Murilo Vasconcelos Feb 02 '11 at 01:02
  • Also, in all the examples the first char in the pattern matches the first char in the string. I that a rule? – Dr. belisarius Feb 02 '11 at 01:02
  • @Justin: I'm implementing in C++, but I can but don't want to use regex. I really need performance. – Murilo Vasconcelos Feb 02 '11 at 01:05
  • @belisarius: Yes, I want to match the entire string. For example find(myset, ".") will return false. – Murilo Vasconcelos Feb 02 '11 at 01:07
  • how long are the strings in your array on average? Based on your example, I don't really see using regex as a problem in performance...I mean maybe the initial memory hit from including the library but as far as getting the job done...I don't really see the problem... – CrayonViolent Feb 02 '11 at 01:12
  • @crayon is a problem because it is expected to be slower than the trivial algorithm, (see the templatetypedef answer below). – Murilo Vasconcelos Feb 02 '11 at 01:16
  • Is `O(size_of(myset) * lenght(pattern))` what you currently have or what you want? I'm having problems even imagining how you could get better complexity than that because better than that wouldn't even process every character unless I'm missing something – user470379 Feb 02 '11 at 01:42
  • What is the length of the longest word in myset? –  Feb 02 '11 at 01:45
  • @Moron: I don't know in detail yet, but is quite small (i can give a bound of 15 for example). The size of myset is big. – Murilo Vasconcelos Feb 02 '11 at 01:53
  • @Murilo: Do you need to return all matches or just one? Are there any memory limits? –  Feb 02 '11 at 01:58
  • @Moron: is a yes/no function (yes if match with at least one word, no otherwise). There are no memory limits but we have to be reasonable. – Murilo Vasconcelos Feb 02 '11 at 02:03
  • @Murilo: I will tell you what I was thinking: For each word, you generate patterns that it might match and put the words along with the patterns into a trie (i.e. you consider .e.t as a word). If you had to return just one match, then you could store a word at the last node for a pattern. Depending on how the memory usage goes, you could potentially generate only some patterns, and rely on backtracking in the trie for the rest etc. –  Feb 02 '11 at 02:05
  • @Moron That's a lot of leaves for 15 chars! – Dr. belisarius Feb 02 '11 at 02:08
  • @belisarius: You could try a tradeoff. For long words (which I guess will be scarce), do a naive search. Generate patterns for the shorter ones. Anyway, I didn't think through it completely, hence chose to comment. Perhaps it will help OP come up with something. –  Feb 02 '11 at 02:11
  • @Moron If there are only a few wildchars you may also reverse the strategy and search for all possible words that satisfy the pattern – Dr. belisarius Feb 02 '11 at 02:16
  • @belisarius: Yes, in fact backtracking in the trie is probably doing that implicitly, in a slightly more optimal way. –  Feb 02 '11 at 02:20
  • @Murilo As much as I love optimization, have you actually experienced performance issues using RegEx in this case? Basically you are asking how to implement a subset of regular expression matching, which is quite a bit of work and likely not significantly faster for the types of searches you're doing. – Justin Morgan Feb 02 '11 at 03:55
  • 1
    @Justin: The problem isn't with regex. The problem is applying the regex into all words of the set. The problem is not match the word but is to reduce the set of possible words for matching. – Murilo Vasconcelos Feb 02 '11 at 03:59
  • Anyway, I guess Murilo does not seem interested. Good luck. –  Feb 02 '11 at 18:33
  • @Moron: I'm interested but I was thinking about what you said and see that for a word with 15 chars I think we'll get 2^15 new words (abcd => abcd, abc., ab.d, ab.., a.cd and so on...). – Murilo Vasconcelos Feb 02 '11 at 18:40
  • 1
    @Murilo: You don't have to do it for all words. I am guessing longer words are rarer, so you can choose to do words with say 8 letters or less and a simple search for the rest (you have to decide based on your dictionary). There might be a lot of repeats (depends on dictionary) and that might save a lot of memory. For instance .e.t will come from best, test, rest, lest etc. So even though all these words generate 2^4 pattern, you will have few total. You can even try generating patterns for prefixes and switch to backtracking near the 'tail'. The tradeoff depends on the data you have though. –  Feb 02 '11 at 18:45
  • Cool. I'll study more and take these comments in consideration. Maybe I'll have to mix some of this ideas with heuristics. Thank you. – Murilo Vasconcelos Feb 02 '11 at 18:51
  • Similar to http://stackoverflow.com/questions/1953080/good-algorithm-and-data-structure-for-looking-up-words-with-missing-letters -- that question is specialized to at most two wildcards, though, and adjacent ones. – Darius Bacon Feb 02 '11 at 20:07

5 Answers5

2

You could build a suffix tree of the corpus of all possible words in your set (see this link) Using this data structure your complexity would include a one time cost of O(n) to build the tree, where n is the sum of the lengths of all your words.

Once the tree is built finding if a string matches should take just O(n) where n is length of the string.

Matthew Manela
  • 16,572
  • 3
  • 64
  • 66
  • 1
    I don't know if I fully understood but I think that find(myset, "...C") in that tree (with "ABAB" and "BABA") will result in a O(N*M) search, because I'll have to branch at the two root's subtrees searching for that pattern. – Murilo Vasconcelos Feb 02 '11 at 01:35
  • A suffix array would be even better space wise: http://en.wikipedia.org/wiki/Suffix_array – BrokenGlass Feb 02 '11 at 02:03
  • Please elaborate. How would this work? -1 till then, to counter the +1. –  Feb 02 '11 at 02:14
  • I misread the part about the wild cards. That would make this not O(n) but I do not think it would be O(N*M) either. If there are no wild cards it would be O(n) but for each wild card you would could basically do a breadth first traversal of the nodes at the wild card level. – Matthew Manela Feb 02 '11 at 02:23
  • For the example I give in my comment above, the search have to visit all the tree's nodes. So its worst case is `O(N*M)`. – Murilo Vasconcelos Feb 02 '11 at 02:29
1

If the set is fixed, you could pre-calculate frequencies of a character c being at position p (for as many p values as you consider worth-while), then search through the array once, for each element testing characters at specific positions in an order such that you are most likely to exit early.

Stephen Denne
  • 36,219
  • 10
  • 45
  • 60
  • Can you give me more details? I dont see how is the search. – Murilo Vasconcelos Feb 02 '11 at 01:50
  • The worst case search is still O(n*m). This simply rearranges the character comparisons to more often return false instead of true, meaning fewer characters per word are searched on average. – Stephen Denne Feb 02 '11 at 02:15
  • 1
    +1: This is a reasonable heuristic. Basically if you create a trie and are backtracking in the trie because of the wildcard, you can choose the next branch to take based on the character frequency. This can be generalized a bit to consider two/three etc characters at a time and might give better performance. See: [N-gram search](http://en.wikipedia.org/wiki/N-gram) –  Feb 02 '11 at 02:16
1

First, divide the corpus into sets per word length. Then your find algorithm can search over the appropriate set, since the input to find() always requires the match to have a specific length, and the algorithm can be designed to work well with all words of the same length.

Next (for each set), create a hash map from a hash of character x position to a list of matching words. It is quite ok to have a large amount of hash collision. You can use delta and run-length encoding to reduce the size of the list of matching words.

To search, pick the appropriate hash map for the find input length, and for each non . character, calculate the hash for that character x position, and AND together the lists of words, to get a much reduced list.

Brute force search through that much smaller list.

Stephen Denne
  • 36,219
  • 10
  • 45
  • 60
0

I had this same question, and I wasn't completely happy with most of the ideas/solutions I found on the internet. I think the "right" way to do this is to use a Directed Acyclic Word Graph. I didn't quite do that, but I added some additional logic to a Trie to get a similar effect.

See my isWord() implementation, analogous to your desired find() interface. It works by recursing down the Trie, branching on wildcard, and then collecting results back into a common set. (See findNodes().)

getMatchingWords() is similar in spirit, except that it returns the set of matching words, instead of just a boolean as to whether or not the query matches anything.

Jameson
  • 6,400
  • 6
  • 32
  • 53
0

If you are sure that the length of the words in your set are not large. You could probably create a table which holds the following:

List of Words which have first Character 'a' , List of Words which have first Character 'b', ..

List of Words which have second Character 'a', List of words which have second Character 'b', ..

and so on.

When you are searching for the word. You can look for the list of words which have the first character same as the search strings' first character. With this refined list, look for the words which have the second character same as the search strings' second character and so on. You can ignore '.' whenever you encounter them.

I understand that building the table may take a large amount of space but the time taken will come down significantly.

For example, if you have myset = {"bar", "foo", "cya", "test"} and you are searching for 'f.o'

The moment you check for list of words starting with f, you eliminate the rest of the set. Just an idea.. Hope it helps.