With the limits you have imposed on your SQL WHERE predicate, it is possible to create a regular expression which extracts the comparison expressions while accounting for constructs like (DATETIME('NOW'))
.
The regular expression i will show and explain requires the WHERE predicate to be syntactically correct. If the WHERE predicate has syntax errors, the regular expression might either not match or produce rubbish results.
The regular expression in its full glory (white-spaces added to enhance readability!):
\w[\w\d]* \s*[<>=]{1,2}\s* ( \?|\w[\w\d]*|(\w[\w\d]*)*((?<PR>\()|(?<-PR>\))|[^()])+ )
While it is certainly not a very long regular expression, it can be nevertheless hard to read and understand. Thus, let's deconstruct this regular expression and explain its several parts. To do so, we will first take a look at what we actually want to extract from the WHERE predicate.
Every expression we want to extract from the WHERE predicate follows the same basic pattern:
SomeIdentifierWithoutParantheses =|<=|>=|<> SomeOtherThingWithOrWithoutParentheses
This (lofty description of a) pattern is sufficient to get an understanding of what the regular expression needs to match to extract the desired parts from the WHERE predicate.
The first part \w[\w\d]*
in our regular expression matches SomeIdentifierWithoutParantheses. This can be any identifier that starts with an alphanumeric character, followed by other alphanumeric characters and/or numerical digits. Examples for such an identifier would be ProjectNumber and My1Ident23.
The second part in the regular expression, \s*[<>=]{1,2}\s*
, matches the comparison operands =
, <=
, >=
and <>
, inclusive any white-spaces before and following after the comparison operator. (Well, it would also match nonsense like =<
, but -as said in the beginning- we assume syntactically correct SQL.)
The third part of the regular expression matches the second operand following the comparision operator, and this looks admittedly somewhat unwieldy: ( \?|\w[\w\d]*|(\w[\w\d]*)*((?<PR>\()|(?<-PR>\))|[^()])+ )
. Let's deconstruct this part of the regular expression somewhat further. As you might perhaps already noticed, the whole thing is an alternation with three alternate options, which will be explained in the following.
\?
obviously matches a single question mark (as in "ProjectNumber=?"). \w[\w\d]*
matches an identifier in the same way as the first operand has been matched.
(\w[\w\d]*)*((?<PR>\()|(?<-PR>\))|[^()])+
contains a specialty of the RegEx engine in .NET: balanced groups. Using such a construct with balanced groups allows the regular expression to match operands which contain (nested) groups of parentheses (
-)
like DATETIME('NOW') or (DATETIME('NOW')).
For another question here on StackOverflow, Martin Büttner has given a great explanation about balanced groups (the topic of the question is "What are regular expression Balancing Groups?") and i would like to point anyone who does not know about balanced groups to his answer (click here to navigate to Martin's answer). Another good explanation can be found on CodeProject.
You will notice that there is nothing that deals with the Boolean operators such as AND or OR. This is not necessary, since these Boolean operators are not part of the pattern which is being matched by the regular expression. (Remember the lofty description of the pattern above?)
How would such an regular expression being used in C# to extract the desired parts from the WHERE predicate?
First, please notice that i will continue using white-spaces in the regular expression for increased readability. This requires RegEx initialization using RegexOptions.IgnorePatternWhitespace or the "(?x)" inline option. In the code below, i use the former.
To extract all parts from the WHERE predicate, the RegEx.Matches method will be used, which returns a collection of Match objects. Each Match object represents an extracted part.
Regex re = new Regex(
@"\w[\w\d]* \s*[<>=]{1,2}\s* ( \?|\w[\w\d]*|(\w[\w\d]*)*((?<PR>\()|(?<-PR>\))|[^()])+ )",
RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled
);
string wherePredicate =
"(ProjectNumber=? AND severity=5) OR DueDate < (DATETIME('NOW'))";
// or use any other WHERE predicate string here...
MatchCollection mc = re.Matches(wherePredicate);
if (mc.Count == 0)
Console.WriteLine("No matches found.");
else
foreach (Match m in mc)
Console.WriteLine("\"{0}\"", m.Value);
You can experiment with the regular expression and different input strings online with the help of the Regex Storm .NET Regex tester.