With your very reduced pattern language, the pastebin link in your question and jpmc26's comments are pretty much all the way there: the main question is, whether the literal left and right end of your input strings match. If they do, and both contain at least one *
, the strings match (because you can always match the other strings intermediate literal text with that star). There is one special case: if only one of them is empty (after removing pre- and suffix), they can still match if the other consists entirely of *
s.
Of course, when checking whether the ends of the string match, you need to take into account the single-character wildcard ?
and character classes, too. The single-character wildcard is easy: it cannot fail, because it will always match whatever the other character is. If it's a character class, and the other is just a character, you need to check whether the character is in the class. If they are both classes, you need to check for an intersection of the classes (which is a simple set intersection).
Here is all of that in JavaScript (check out the code comments to see how the algorithm I outlined above maps to the code):
var trueInput = [
{ left: 'prefix*', right: 'prefix:extended*' },
{ left: '*suffix', right: '*:extended:suffix' },
{ left: 'left*right', right: 'left*middle*right' },
{ left: 'a*b*c', right: 'a*b*d*b*c' },
{ left: 'hello*', right: '*ok' },
{ left: '*', right: '*'},
{ left: '*', right: '**'},
{ left: '*', right: ''},
{ left: '', right: ''},
{ left: 'abc', right: 'a*c'},
{ left: 'a*c', right: 'a*c'},
{ left: 'a[bc]d', right: 'acd'},
{ left: 'a[bc]d', right: 'a[ce]d'},
{ left: 'a?d', right: 'acd'},
{ left: 'a[bc]d*wyz', right: 'abd*w[xy]z'},
];
var falseInput = [
{ left: 'prefix*', right: 'wrong:prefix:*' },
{ left: '*suffix', right: '*suffix:wrong' },
{ left: 'left*right', right: 'right*middle*left' },
{ left: 'abc', right: 'abcde'},
{ left: 'abcde', right: 'abc'},
{ left: 'a[bc]d', right: 'aed'},
{ left: 'a[bc]d', right: 'a[fe]d'},
{ left: 'a?e', right: 'acd'},
{ left: 'a[bc]d*wyz', right: 'abc*w[ab]z'},
];
// Expects either a single-character string (for literal strings
// and single-character wildcards) or an array (for character
// classes).
var characterIntersect = function(a,b) {
// If one is a wildcard, there is an intersection.
if (a === '?' || b === '?')
return true;
// If both are characters, they must be the same.
if (typeof a === 'string' && typeof b === 'string')
return a === b;
// If one is a character class, we check that the other
// is contained in the class.
if (a instanceof Array && typeof b === 'string')
return (a.indexOf(b) > -1);
if (b instanceof Array && typeof a === 'string')
return (b.indexOf(a) > -1);
// Now both have to be arrays, so we need to check whether
// they intersect.
return a.filter(function(character) {
return (b.indexOf(character) > -1);
}).length > 0;
};
var patternIntersect = function(a,b) {
// Turn the strings into character arrays because they are
// easier to deal with.
a = a.split("");
b = b.split("");
// Check the beginnings of the string (up until the first *
// in either of them).
while (a.length && b.length && a[0] !== '*' && b[0] !== '*')
{
// Remove the first character from each. If it's a [,
// extract an array of all characters in the class.
aChar = a.shift();
if (aChar == '[')
{
aChar = a.splice(0, a.indexOf(']'));
a.shift(); // remove the ]
}
bChar = b.shift();
if (bChar == '[')
{
bChar = b.splice(0, b.indexOf(']'));
b.shift(); // remove the ]
}
// Check if the two characters or classes overlap.
if (!characterIntersect(aChar, bChar))
return false;
}
// Same thing, but for the end of the string.
while (a.length && b.length && a[a.length-1] !== '*' && b[b.length-1] !== '*')
{
aChar = a.pop();
if (aChar == ']')
{
aChar = a.splice(a.indexOf('[')+1, Number.MAX_VALUE);
a.pop(); // remove the [
}
bChar = b.pop();
if (bChar == ']')
{
bChar = b.splice(b.indexOf('[')+1, Number.MAX_VALUE);
b.pop(); // remove the [
}
if (!characterIntersect(aChar, bChar))
return false;
}
// If one string is empty, the other has to be empty, too, or
// consist only of stars.
if (!a.length && /[^*]/.test(b.join('')) ||
!b.length && /[^*]/.test(b.join('')))
return false;
// The only case not covered above is that both strings contain
// a * in which case they certainly overlap.
return true;
};
console.log('Should be all true:');
console.log(trueInput.map(function(pair) {
return patternIntersect(pair.left, pair.right);
}));
console.log('Should be all false:');
console.log(falseInput.map(function(pair) {
return patternIntersect(pair.left, pair.right);
}));
It's not the neatest implementation, but it works and is (hopefully) still quite readable. There is a fair bit of code duplication with checking the beginning and the end (which could be alleviated with a simple reverse
after checking the beginning - but I figured that would just obscure things). And there are probably tons of other bits that could be greatly improved, but I think the logic is all in place.
A few more remarks: the implementation assumes that the patterns are well-formatted (no unmatched opening or closing brackets). Also, I took the array intersection code from this answer because it's compact - you could certainly improve on the efficiency of that if necessary.
Regardless of those implementation details, I think I can answer your complexity question, too: the outer loop goes over both strings at the same time, a character at a time. So that's linear complexity. Everything inside the loop can be done in constant time, except the character class tests. If one character is a character class and the other isn't, you need linear time (with the size of the class being the parameter) to check whether the character is in the class. But this doesn't make it quadratic, because each character in the class means one less iteration of the outer loop. So that's still linear. The most costly thing is hence the intersection of two character classes. This might be more complex that linear time, but the worst it could get is O(N log N)
: after all, you could just sort both character classes, and then find an intersection in linear time. I think you might even be able to get overall linear time complexity, by hashing the characters in the character class to their Unicode code point (.charCodeAt(0)
in JS) or some other number - and finding an intersection in a hashed set is possible in linear time. So, if you really want to, I think you should be able to get down to O(N)
.
And what is N
? The upper limit is sum of the length of both patterns, but in most cases it will actually be less (depending on the length of prefixes and suffixes of both patterns).
Please point me to any edge-cases my algorithm is missing. I'm also happy about suggested improvements, if they improve or at least don't reduce the clarity of the code.
Here is a live demo on JSBin (thanks to chakrit for pasting it there).
EDIT: As Daniel pointed out, there is a conceptual edge-case that my algorithm misses out on. If (before or after elimination of the beginning and end) one string contains no *
and the other does, there are cases, where the two still clash. Unfortunately, I don't have the time right now to adjust my code snippet to accommodate that problem, but I can outline how to resolve it.
After eliminating both ends of the strings, if both strings are either empty or both contain at least *
, they will always match (go through the possible *
-distributions after complete elimination to see this). The only case that's not trivial is if one string still contains *
, but the other doesn't (be it empty or not). What we now need to do is walk both strings again from left to right. Let me call the string that contains *
A and the one that doesn't contain *
B.
We walk A from left to right, skipping all *
(paying attention only to ?
, character classes and literal characters). For each of the relevant tokens, we check from left to right, if it can be matched in B (stopping at the first occurrence) and advance our B-cursor to that position. If we ever find a token in A that cannot be found in B any more, they do not match. If we manage to find a match for each token in A, they do match. This way, we still use linear time, because there is no backtracking involved. Here are two examples. These two should match:
A: *a*[bc]*?*d* --- B: db?bdfdc
^ ^
A: *a*[bc]*?*d* --- B: db?bdfdc
^ ^
A: *a*[bc]*?*d* --- B: db?bdfdc
^ ^
A: *a*[bc]*?*d* --- B: db?bdfdc
^ ^
These two should not match:
A: *a*[bc]*?*d* --- B: dbabdfc
^ ^
A: *a*[bc]*?*d* --- B: dbabdfc
^ ^
A: *a*[bc]*?*d* --- B: dbabdfc
^ ^
A: *a*[bc]*?*d* --- B: dbabdfc
!
It fails, because the ?
cannot possibly match before the second d
, after which there is no further d
in B to accommodate for the last d
in A.
This would probably be easy to add to my current implementation, if I had taken the time to properly parse the string into token objects. But now, I'd have to go through the trouble of parsing those character classes again. I hope this written outline of the addition is sufficient help.
PS: Of course, my implementation does also not account for escaping metacharacters, and might choke on *
inside character classes.