17

I was writing some Unit tests last week for a piece of code that generated some SQL statements.

I was trying to figure out a regex to match SELECT, INSERT and UPDATE syntax so I could verify that my methods were generating valid SQL, and after 3-4 hours of searching and messing around with various regex editors I gave up.

I managed to get partial matches but because a section in quotes can contain any characters it quickly expands to match the whole statement.

Any help would be appreciated, I'm not very good with regular expressions but I'd like to learn more about them.

By the way it's C# RegEx that I'm after.

Clarification

I don't want to need access to a database as this is part of a Unit test and I don't wan't to have to maintain a database to test my code. which may live longer than the project.

Nimantha
  • 6,405
  • 6
  • 28
  • 69
Omar Kooheji
  • 54,530
  • 68
  • 182
  • 238
  • Please not that regular expressions (classic and newer ones) have limitations when used to parse general syntax. For more details see https://stackoverflow.com/q/8577060/6607497 – U. Windl Jun 17 '21 at 05:59

13 Answers13

39

Regular expressions can match languages only a finite state automaton can parse, which is very limited, whereas SQL is a syntax. It can be demonstrated you can't validate SQL with a regex. So, you can stop trying.

Pablo Marambio
  • 1,562
  • 1
  • 15
  • 29
  • Agreed. Some other approach is needed here other than regexp. There's no way to match legal syntax for a select statement: http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/statements_10002.htm#SQLRF01702 – David Aldridge Sep 26 '08 at 16:24
  • Agreed. Actually, you need a SQL parser. Here is an article shows how to do vendor-specific offline SQL syntax check with the help of general sql parser: http://www.dpriver.com/blog/list-of-demos-illustrate-how-to-use-general-sql-parser/vendor-specific-offline-sql-syntax-check/ – James Wang Jan 08 '12 at 12:15
  • 3
    Any context free grammar can be parsed by a modern regular expressions. Regular expressions no longer are restricted to parsing regular languages. If this were true, the language `{ a^nb^n }` would not be parseable, however this can be parsed by the regular expression `/^(a(\1)?b)$/`. See https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html for more info. – Jacklynn Jul 18 '16 at 22:48
  • @Jack interesting. Is this a "standard" feature? – Pablo Marambio Oct 18 '16 at 17:56
  • @Pablo I believe backreferences, which are a standard feature of regular expressions, are not permissible in regular languages but I'm not sure of the proof right now. I am aware that that regular expression won't be parsed by lots of parsers, anyway, just an example. The point is that the "regular" in "regular expressions" is actually meaningless. – Jacklynn Oct 19 '16 at 15:55
15

SQL is a type-2 grammar, it is too powerful to be described by regular expressions. It's the same as if you decided to generate C# code and then validate it without invoking a compiler. Database engine in general is too complex to be easily stubbed.

That said, you may try ANTLR's SQL grammars.

Constantin
  • 27,478
  • 10
  • 60
  • 79
  • 1
    Any context free grammar can be parsed by a modern regular expressions. Regular expressions no longer are restricted to parsing regular languages. If this were true, the language `{ a^nb^n }` would not be parseable, however this can be parsed by the regular expression `/^(a(\1)?b)$/`. See https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html for more info. – Jacklynn Jul 18 '16 at 22:44
  • 1
    @Jack, good point! So what would be the C# regex for ANSI SQL SELECT? – Constantin Aug 04 '16 at 20:21
2

As far as I know this is beyond regex and your getting close to the dark arts of BnF and compilers.

http://savage.net.au/SQL/

Same things happens to people who want to do correct syntax highlighting. You start cramming things into regex and then you end up writing a compiler...

jason saldo
  • 9,804
  • 5
  • 34
  • 41
2

I had the same problem - an approach that would work for all the more standard sql statements would be to spin up an in-memory Sqlite database and issue the query against it, if you get back a "table does not exist" error, then your query parsed properly.

George Mauer
  • 117,483
  • 131
  • 382
  • 612
1

Off the top of my head: Couldn't you pass the generated SQL to a database and use EXPLAIN on them and catch any exceptions which would indicate poorly formed SQL?

JeeBee
  • 17,476
  • 5
  • 50
  • 60
0

Have you tried the lazy selectors. Rather than match as much as possible, they match as little as possible which is probably what you need for quotes.

Orion Adrian
  • 19,053
  • 13
  • 51
  • 67
0

To validate the queries, just run them with SET NOEXEC ON, that is how Entreprise Manager does it when you parse a query without executing it.

Besides if you are using regex to validate sql queries, you can be almost certain that you will miss some corner cases, or that the query is not valid from other reasons, even if it's syntactically correct.

Pop Catalin
  • 61,751
  • 23
  • 87
  • 115
0

I suggest creating a database with the same schema, possibly using an embedded sql engine, and passing the sql to that.

Marcin
  • 48,559
  • 18
  • 128
  • 201
0

I don't think that you even need to have the schema created to be able to validate the statement, because the system will not try to resolve object_name etc until it has successfully parsed the statement.

With Oracle as an example, you would certainly get an error if you did:

select * from non_existant_table;

In this case, "ORA-00942: table or view does not exist".

However if you execute:

select * frm non_existant_table;

Then you'll get a syntax error, "ORA-00923: FROM keyword not found where expected".

It ought to be possible to classify errors into syntax parsing errors that indicate incorrect syntax and errors relating to tables name and permissions etc..

Add to that the problem of different RDBMSs and even different versions allowing different syntaxes and I think you really have to go to the db engine for this task.

David Aldridge
  • 51,479
  • 8
  • 68
  • 96
0

There are ANTLR grammars to parse SQL. It's really a better idea to use an in memory database or a very lightweight database such as sqlite. It seems wasteful to me to test whether the SQL is valid from a parsing standpoint, and much more useful to check the table and column names and the specifics of your query.

MattMcKnight
  • 8,185
  • 28
  • 35
0

The best way is to validate the parameters used to create the query, rather than the query itself. A function that receives the variables can check the length of the strings, valid numbers, valid emails or whatever. You can use regular expressions to do this validations.

-1

I am assuming you did something like .\* try instead [^"]* that will keep you from eating the whole line. It still will give false positives on cases where you have \ inside your strings.

Nimantha
  • 6,405
  • 6
  • 28
  • 69
J.J.
  • 4,856
  • 1
  • 24
  • 29
-1
public bool IsValid(string sql)
{
string pattern = @"SELECT\s.*FROM\s.*WHERE\s.*";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
return rgx.IsMatch(sql);
}