4

I am looking for a solution to match a single string against a set of wildcard strings. For example

>>> match("ab", ["a*", "b*", "*", "c", "*b"])
["a*", "*", "*b"]

The order of the output is of no importance.

I will have in the order of 10^4 wildcard strings to match against and I will do around ~10^9 match calls. This means I will probably have to rewrite my code like so:

>>> matcher = prepare(["a*", "b*", "*", "c", "*b"]
>>> for line in lines: yield matcher.match("ab")
["a*", "*", "*b"]

I've started writing a trie implementation in Python that handles wildcards and I just need to get those corner cases right. Despite this I am curious to hear; How would you solve this? Are there any Python libraries out there that make me solve this faster?

Some insights so far:

  • Named (Python, re) regular expressions will not help me here since they'll only return one match.
  • pyparsing seems like an awesome library, but is sparsely documented and does not, as I see it, support matching multiple patterns.
Ztyx
  • 14,100
  • 15
  • 78
  • 114
  • do you mean there are `10**5` strings and `10**4` patterns and you need to return a list of matching patterns for each individual string or is it enough to return a single matching (if any) pattern for each string? – jfs Oct 15 '12 at 22:34
  • A few questions. How long are the strings? Can the wildcards be literally any unicode character, or is it purely `string.letters`? – kreativitea Oct 15 '12 at 22:42
  • Have you considered using regular expressions? – Joel Cornett Oct 15 '12 at 22:43
  • J.F Sebastian: I actually want to count the number of occurences of each pattern in a huge log file. – Ztyx Oct 16 '12 at 04:55
  • Kreativitea: Looks like the longest pattern is 980 characters. Not sure about the longest needle string, but maybe 2000... – Ztyx Oct 16 '12 at 04:59

2 Answers2

4

You could use FilteredRE2 class from re2 library with a help from Aho-Corasick algorithm implementation (or similar). From re2 docs:

Required substrings. Suppose you have an efficient way to check which of a list of strings appear as substrings in a large text (for example, maybe you implemented the Aho-Corasick algorithm), but now your users want to be able to do regular expression searches efficiently too. Regular expressions often have large literal strings in them; if those could be identified, they could be fed into the string searcher, and then the results of the string searcher could be used to filter the set of regular expression searches that are necessary. The FilteredRE2 class implements this analysis. Given a list of regular expressions, it walks the regular expressions to compute a boolean expression involving literal strings and then returns the list of strings. For example, FilteredRE2 converts (hello|hi)world[a-z]+foo into the boolean expression “(helloworld OR hiworld) AND foo” and returns those three strings. Given multiple regular expressions, FilteredRE2 converts each into a boolean expression and returns all the strings involved. Then, after being told which of the strings are present, FilteredRE2 can evaluate each expression to identify the set of regular expressions that could possibly be present. This filtering can reduce the number of actual regular expression searches significantly.

The feasibility of these analyses depends crucially on the simplicity of their input. The first uses the DFA form, while the second uses the parsed regular expression (Regexp*). These kind of analyses would be more complicated (maybe even impossible) if RE2 allowed non-regular features in its regular expressions.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
3

Seems like Aho-Corasick algorithm would work. esmre seem to do what I'm looking for. I got this information from this question.

Community
  • 1
  • 1
Ztyx
  • 14,100
  • 15
  • 78
  • 114