2

I am writing a c# class library that will allow me to scan an SQL server query and extract objects from the query into their correct groupings, for example:

SELECT * FROM "My Server"."Northwind"."dbo"."Product Sales for 1997" Group By CategoryID

This regex will will match the above string and group "My Server", "Northwind", "dbo" and "Product Sales for 1997" into four groups, which is what I want.

(?i)\bFROM\b\s+[\["]([^\]"]*)[\]"].{1}[\["]([^\]"]*)[\]"].{1}[\["]([^\]"]*)[\]"].{1}[\["]([^\]"]*)[\]"].{1}

What I am looking for is a single regex expression that can capture the server name, database name, schema name and object name for any of the following combinations (this is not an exhaustive list by any means):

SELECT * FROM dbo."Product Sales for 1997" // should return groups 2 & 3
SELECT * FROM Northwind."My Schema"."My view or table function" // should return  groups 1, 2 & 3
SELECT * FROM "My view or table function" // should return group 3
SELECT * FROM dbo."My View 1" AS V1 JOIN "My View 1" AS V2 ON V1.ID = V2 // should return groups 2 & 3

In other words I want to capture the various components into the following groups:

Group 0 --> Server Name
Group 1 --> Database Name
Group 2 --> Schema
Group 3 --> Object Name

I am trying to avoid creating multiple regex expressions to handle every possible combination to avoid my class library becoming too large and complex, but as a regex n00b it's proving a bit difficult.

Intrepid
  • 2,781
  • 2
  • 29
  • 54
  • 2
    Regular expressions aren't suited to this. You need a parser. e.g. [`Microsoft.Data.Schema.ScriptDom`](http://stackoverflow.com/a/5793088/73226) – Martin Smith Dec 04 '11 at 13:37
  • Thanks for the quick response, however writing a parser is out of the question due to time constraints and the complexity of writing one. Are there any pre-existing T-SQL parsers that I can use? I did manage to write regex expressions to extract the components out of an EXEC statement, but only works if there are no spaces in the object names. I quickly realised that to cover all combinations I would end up with too many regex expressions that I really wanted. – Intrepid Dec 04 '11 at 13:42
  • `Microsoft.Data.Schema.ScriptDom` as above. If that isn't available to you see the comments of the OP [in this question](http://stackoverflow.com/q/7690380/73226) for another alternative. – Martin Smith Dec 04 '11 at 13:46
  • A parser would be a good idea, but if you need a quick regex fix, from the examples it looks like when you have xxx.xxx.xxx.xxx it's groups 0,1,2,3; with xxx.xxx.xxx it's 1,2,3; with xxx.xx it's 2,3; with xxx it's 3. Then you could do `\bFROM\b\s+(?:(?:(?:xx)?xx)?xx)?xx` as your regex, with `xx` being the regex you have above, `[\["]([^\]"]*)[\]"].{1}`. (Note also the `.{1}` is redundant; `.` would do). – mathematical.coffee Dec 13 '11 at 00:11
  • After a bit of thought I have decided to have a go at implementing a parser using [`Microsoft.Data.Schema.ScriptDom`](http://blogs.msdn.com/b/gertd/archive/2008/08/21/getting-to-the-crown-jewels.aspx) rather than using regex because it soon transpired that I needed to parse other SQL statements. – Intrepid Dec 19 '11 at 09:09

2 Answers2

0

The best one can do with regex is to parse it into tokens and then one will have to determine what the actuals (server database etc) of the groups are. Here is a regex to to get your data examples into such tokens. Note I wasn't aware that sql server had quotes, but your example asked for them so I look for single and double quotes escaped as \x22 and \x27 respectively using an If conditional (see my blog article Regular Expressions and the If Conditional). The tokens are placed into match captures where they are extracted.

string data =
@"SELECT * FROM dbo.""Product Sales for 1997"" // should return groups 2 & 3 
SELECT * FROM Northwind.""My Schema"".""My view or table function"" // should return  groups 1, 2 & 3 
SELECT * FROM ""My view or table function"" // should return group 3 
SELECT * FROM dbo.""My View 1"" AS V1 JOIN ""My View 1"" AS V2 ON V1.ID = V2 // should return groups 2 & 3 ";

string pattern = 
@"
(?:FROM\s+)                 # Work from a from only
(
  (?([\x27\x22])            # If a single or double quote is found      
     (?:[\x27\x22])
       (?<Tokens>[\w\s]+)   # process quoted text
     (?:[\x27\x22]\.?)
   |                        # else
     (?!\s+AS|\s+WHERE)     # if AS or Where is found stop the match we are done
     (?:\.?)
     (?<Tokens>\w+)         # Process non quoted token.
     (?:\.?)
   )
   (?![\n\r/])              # Stop on CR/LF or a comment. 
){0,4}                      # Only do this 1 to 4 times, for it can't be more (parser hint to stop)
";

Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace) // Ignore is to allow commenting of the pattern only (not data processing)
    .OfType<Match>()
    .Select(mt => mt.Groups["Tokens"]
                    .Captures.OfType<Capture>()
                    .Select(cp => cp.Value))
    .ToList() // To do the foreach below
    .ForEach(tokens => Console.WriteLine(string.Join(" | ", tokens)));

/* Output
dbo | Product Sales for 1997
Northwind | My Schema | My view or table function
My view or table function
dbo | My View 1
*/
ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122
0

To parse arbitrary SQL query, you're far better off with a SQL parser. Trying to parse arbitrary SQL with regexes will amount to writing your own parser.

With the help of a full SQL parser, that's pretty easy to achieve what you need:

SELECT * FROM Northwind."My Schema"."My view or table function";

output will be something like this:

select clause:
Columns
Fullname:*
Prefix: Column:*    alias:

from clause:
   Northwind."My Schema"."My view or table function"

database: Northwind
schema:   "My Schema"
object:   "My view or table function"
object alias:

You can try this demo yourself to test more complicated SQL queries.

James Wang
  • 453
  • 4
  • 5