8

I have a scenario where I would like to search using a wildcarded pattern ontop of a string that already has wildcards in it. In my words, i would say it is a 2 way pattern matching requirement.

The input and pattern string can have either/both of the following wildcards - ? representing a single character and % representing zero or more characters. Assume these are the only 2 wildcards allowed in input and the pattern strings.

For ex:

bool IsMatch(string input, string pattern) //Should return True if the input string matches the pattern, should return False otherwise.

IsMatch("XYZ%", "?Y%") // Should return True

IsMatch("YY?", "?Y%") // Should return True - Last character in input string expects a single character where as the pattern matches to zero or more characters after Y (which means it includes a single character match as well)

IsMatch("X123", "?Y%") // Should return False - Missing Y in the input string which the pattern expects

IsMatch("?Y%", "?Y%")// Should return True

IsMatch("%", "?Y%")// Should return True - The input string has a wildcard % representing zero or more characters and can also have any character(s). In a way, it acts as a pattern in itself representing anything of any size.

I'm able to find articles (ex: Regex) that only talk about performing a wildcarded pattern match on a non-wildcarded string. I'm looking for pointers/ideas on the algorithm as it is getting difficult for me to come up with an algorithm that can do this kind of match, as I start putting it down. Appreciate your inputs.

PKumar
  • 91
  • 5
  • How do you figure these match? The X and the Y match, the % (which is a wildcard usually used in databases but not in regexes) matches with the rest of the input (Z5%), but after that, your search pattern demands a HH?123 which is not found in the input. Or do you figure those are matched by the wildcard at the end of your input? That makes your question 'is there an input these regexes both can match?' which is not easy to answer. – SQB Nov 07 '13 at 18:16
  • @SQB `XY` match in both, `Z5` in the first is matched by `%` in second, `HH?123` in second is matched by `%` in first. Kind of a bidirectional matching. – Reinstate Monica -- notmaynard Nov 07 '13 at 18:19
  • 1
    @iamnotmaynard Yes, I was afraid of that. As I said, that makes the real question 'is there _any_ input these two regexes both can match?' – SQB Nov 07 '13 at 18:21
  • @SQB The input string is stored in the DB, which I fetch and compare with the search/pattern string. The expectation is that, in my app UI, if the user searches using this search/pattern string, my app(C#.net) should match this input string that was fetched from the DB and display this input string to the user. – PKumar Nov 07 '13 at 18:29
  • You're going to have to clearly define your pattern matching language used (or point to something online) and give some search _patterns_ and various _inputs_ for each -- and explain why it is or is not considered a match. Only then can anyone tell you if this can be done with regexps. – Phil Perry Nov 07 '13 at 19:58
  • So in your example `?` is for `HH` and not just for last `H`, right? How would you then write syntax for having first `H` mandatory and second `H` optional? – Ωmega Nov 07 '13 at 20:20
  • I edited my question with lot more details and also changed the examples as the initial one was either confusing or mis-leading. – PKumar Nov 08 '13 at 02:50
  • @SQB is correct. One way to answer the [real] question would be to transform one of the pattern languages into the other language and then lexically analyze both patterns. This is far from foolproof - but you are asking a really hard question to answer. – Sam Axe Nov 08 '13 at 02:58
  • http://stackoverflow.com/questions/3630203/does-an-algorithm-exist-which-can-determine-whether-one-regular-language-matches - Your problem is a version of inclusion problem. You can transform one of the "regexes" to it's equivalent format in the other, construct minimum DFA and compare them. There's no simpler way for all the general cases. – hawk Nov 08 '13 at 03:04
  • @Dan-o/hawk Let us assume both are of the same pattern languages (which is the case for my scenario); I had edited my examples. Sorry about the confusion. – PKumar Nov 08 '13 at 04:01

1 Answers1

2

As I wrote in my comment for the most general cases you'd have to create the minimal deterministic finite automaton of the two expressions and compare the two automatons. Having said that there may be a bruteforce/poorman's solution to your question.

Based on your examples it sounds like you're interested in seeing if one of input/pattern matches all the strings generated by the other.

IsMatch("XYZ%", "?Y%") // returns true because ?Y% matches a superset of strings matched by "XYZ%"
IsMatch("%", "?Y%") // returns true because "%" matches a superset of "?Y%"

You can check if input indeed matches a subset of strings generated by pattern as long as

  • you really can limit yourselves to % and ? operators as specified
  • your input/pattern strings are reasonably short - more specifically the occurences of % in either input or pattern are less than about 20 or so.

The basic idea is you generate a list of representative strings for input and match each one with pattern using your favorite regex engine. If all the representatives match - input matches a subset of pattern. This algorithm for IsSubset can be described as follows

let c = some character not in `pattern` (lexically speaking)
let searchString = replace all occurences of '?' in input with c
add searchString to setOfSearchStrings
foreach occurence of '%' in input
    foreach str in setOfSearchStrings
        replace str with two strings - {str with c in place of '%', str without the '%'}

foreach str in setOfSearchStrings
   if str doesn't "regex" match with pattern 
       return false

return true

for example if input is ?X%YZ% and the pattern doesn't contain the character A the list generated would be
AXYZ
AXYZA
AXAYZ
AXAYZA

It's easy to see that the number of strings in this list is 2^n where n is the number of '%' in input.

Also it's easy to swap the order of arguments and figure out the relationship the other way round. So in effect your

IsMatch(input,pattern) = IsSubset(input,pattern) || IsSubset(pattern,input)
hawk
  • 1,827
  • 2
  • 14
  • 28
  • Hi Hawk, thank you very much for sharing your ideas and providing a detailed explanation. This is a very good idea and for sure helps me to think about the solution in this direction. – PKumar Nov 08 '13 at 08:33