2

How do I parse a limited SQL where-clause using regular expression?

The where-clause is restricted in its format. It will not contain sub-queries. It is limited to just "AND", "OR" and "()".

So if I am given a where-clause, I want to extract the parts from it.

3 examples of where-clause:

  1. ProjectNumber=?
  2. ProjectNumber =? AND severity = 5
  3. (ProjectNumber=? AND severity=5) OR DueDate < (DATETIME('NOW'))

For example 1, I want "ProjectNumber=?"

For example 2, I want "ProjectNumber =?", "severity = 5"

For example 3, I want "ProjectNumber=?", "severity=5", "DueDate < (DATETIME('NOW'))"

The "?" means that the value is parameterised.

I understand that regular expression is not powerful enough to parse full SQL where-clauses.

I have found this "(?<=^|\A|(AND|OR))(?:[^']|'(?:[^']|'{2})+')*?(?=(AND|OR)|$|\Z)" but it doesn't work for 3.

The programming language is C# and database is SQLite.

p.s. I am very basic experience with Regex.

Thanks.

p.s. Below is my current C# code:

string query = @"(ProjectNumber=? AND severity=5) OR DueDate < (DATETIME('NOW'))";
string pattern = @"(?<=^|\A|(AND|OR))(?:[^']|'(?:[^']|'{2})+')*?(?=(AND|OR)|$|\Z)";
MatchCollection matches = Regex.Matches(query, pattern);
foreach (Match match in matches) 
    Console.WriteLine(match.ToString());
    // currently Console.WriteLine() gives the following:
    // (ProjectNumber=?
    // severity=5)
    //  DueDate < (DATETIME('NOW'))
henrykodev
  • 2,964
  • 3
  • 27
  • 39
  • Must you do it in one Regex? You could split it into two (or even three), separated by the `||` or and have code that's a lot easier to read and maintain. – Hambone Jul 31 '14 at 02:28
  • No, it doesn't have to be one Regex, multiple regex code would be good too. – henrykodev Jul 31 '14 at 02:40
  • 1
    You failed to explain as **which** kind of data structure you want to extract the parts. As a binary expression tree containing strings (each node having a single string representing a token such as an identifier or operator)? –  Jul 31 '14 at 02:47
  • You have your where clause already as simple string... please give an example input WHERE clause and the resulting data. Provide the resulting data in C# syntax - exactly as it would be required by your software. Right now it is totally unclear **what** you want to have as result. –  Jul 31 '14 at 02:54
  • As I said in my post, one example of where-clause is: `(ProjectNumber=? AND severity=5) OR DueDate < (DATETIME('NOW'))`. From this, I want 3 strings: `"ProjectNumber=?"`, `"severity=5"`, `"DueDate < (DATETIME('NOW'))"`. Thanks. – henrykodev Jul 31 '14 at 03:00
  • Ahh, okay. Sorry that i misunderstood your question. I didn't expect you just wanted the comparisons and discard of any parantheses/grouping and boolean operators. It's quite unusual but now i understand. Sorry again for not getting your problem before... –  Jul 31 '14 at 03:03
  • Just some clarification necessary: Would a WHERE clause like `ProjectNumber=? AND severity=(5+4)` be valid for your case (note the `(5+4)` which is part of the comparison expression)? –  Jul 31 '14 at 03:11
  • Good question... I'd say no for now. – henrykodev Jul 31 '14 at 03:13

1 Answers1

8

With the limits you have imposed on your SQL WHERE predicate, it is possible to create a regular expression which extracts the comparison expressions while accounting for constructs like (DATETIME('NOW')).

The regular expression i will show and explain requires the WHERE predicate to be syntactically correct. If the WHERE predicate has syntax errors, the regular expression might either not match or produce rubbish results.

The regular expression in its full glory (white-spaces added to enhance readability!):

\w[\w\d]* \s*[<>=]{1,2}\s* ( \?|\w[\w\d]*|(\w[\w\d]*)*((?<PR>\()|(?<-PR>\))|[^()])+ )

While it is certainly not a very long regular expression, it can be nevertheless hard to read and understand. Thus, let's deconstruct this regular expression and explain its several parts. To do so, we will first take a look at what we actually want to extract from the WHERE predicate.

Every expression we want to extract from the WHERE predicate follows the same basic pattern:

SomeIdentifierWithoutParantheses =|<=|>=|<> SomeOtherThingWithOrWithoutParentheses

This (lofty description of a) pattern is sufficient to get an understanding of what the regular expression needs to match to extract the desired parts from the WHERE predicate.

The first part \w[\w\d]* in our regular expression matches SomeIdentifierWithoutParantheses. This can be any identifier that starts with an alphanumeric character, followed by other alphanumeric characters and/or numerical digits. Examples for such an identifier would be ProjectNumber and My1Ident23.

The second part in the regular expression, \s*[<>=]{1,2}\s*, matches the comparison operands =, <=, >= and <>, inclusive any white-spaces before and following after the comparison operator. (Well, it would also match nonsense like =<, but -as said in the beginning- we assume syntactically correct SQL.)

The third part of the regular expression matches the second operand following the comparision operator, and this looks admittedly somewhat unwieldy: ( \?|\w[\w\d]*|(\w[\w\d]*)*((?<PR>\()|(?<-PR>\))|[^()])+ ). Let's deconstruct this part of the regular expression somewhat further. As you might perhaps already noticed, the whole thing is an alternation with three alternate options, which will be explained in the following.

\? obviously matches a single question mark (as in "ProjectNumber=?"). \w[\w\d]* matches an identifier in the same way as the first operand has been matched.

(\w[\w\d]*)*((?<PR>\()|(?<-PR>\))|[^()])+ contains a specialty of the RegEx engine in .NET: balanced groups. Using such a construct with balanced groups allows the regular expression to match operands which contain (nested) groups of parentheses (-) like DATETIME('NOW') or (DATETIME('NOW')).

For another question here on StackOverflow, Martin Büttner has given a great explanation about balanced groups (the topic of the question is "What are regular expression Balancing Groups?") and i would like to point anyone who does not know about balanced groups to his answer (click here to navigate to Martin's answer). Another good explanation can be found on CodeProject.

You will notice that there is nothing that deals with the Boolean operators such as AND or OR. This is not necessary, since these Boolean operators are not part of the pattern which is being matched by the regular expression. (Remember the lofty description of the pattern above?)


How would such an regular expression being used in C# to extract the desired parts from the WHERE predicate?

First, please notice that i will continue using white-spaces in the regular expression for increased readability. This requires RegEx initialization using RegexOptions.IgnorePatternWhitespace or the "(?x)" inline option. In the code below, i use the former.

To extract all parts from the WHERE predicate, the RegEx.Matches method will be used, which returns a collection of Match objects. Each Match object represents an extracted part.

Regex re = new Regex(
    @"\w[\w\d]* \s*[<>=]{1,2}\s* ( \?|\w[\w\d]*|(\w[\w\d]*)*((?<PR>\()|(?<-PR>\))|[^()])+ )",
    RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled
);


string wherePredicate =
    "(ProjectNumber=? AND severity=5) OR DueDate < (DATETIME('NOW'))";
    // or use any other WHERE predicate string here...

MatchCollection mc = re.Matches(wherePredicate);

if (mc.Count == 0)
    Console.WriteLine("No matches found.");
else
    foreach (Match m in mc)
        Console.WriteLine("\"{0}\"", m.Value);    


You can experiment with the regular expression and different input strings online with the help of the Regex Storm .NET Regex tester.

Community
  • 1
  • 1
  • Thank you @elgonzo for the great answer - very clear and detailed, and works - can't ask for a better answer that!!! – henrykodev Jul 31 '14 at 05:11
  • Glad that i could help :) I would have loved showing a graphical representation of the regex as rendered by the www.debuggex.com regex tester, but this service unfortunately does not (yet) support/understand balanced groups. Bummer... :( –  Jul 31 '14 at 05:13