13

I need a Regular Expression to capture ALL comments in a block of T-SQL. The Expression will need to work with the .Net Regex Class.

Let's say I have the following T-SQL:

-- This is Comment 1
SELECT Foo FROM Bar
GO

-- This is
-- Comment 2
UPDATE Bar SET Foo == 'Foo'
GO

/* This is Comment 3 */
DELETE FROM Bar WHERE Foo = 'Foo'

/* This is a
multi-line comment */
DROP TABLE Bar

I need to capture all of the comments, including the multi-line ones, so that I can strip them out.

EDIT: It would serve the same purpose to have an expression that takes everything BUT the comments.

Sachin Joseph
  • 18,928
  • 4
  • 42
  • 62
bopapa_1979
  • 8,949
  • 10
  • 51
  • 76
  • This answer should help - http://stackoverflow.com/questions/653143/c-regex-to-get-the-comments-block-out-of-pl-sql-code – ipr101 Oct 07 '11 at 16:59
  • 2
    How would you handle `/*/* Nested Comments? */*/` – JNK Oct 07 '11 at 17:14
  • @ipr101 - I have spent a good while looking at the post you mentioned, but without any luck. I am not very good at regular expressions. I have not been able to modify the example so that it gets only the comments, and none of the valid SQL statements. – bopapa_1979 Oct 07 '11 at 17:17
  • @JNK - I'd like for the entire nested comment, including the outer markers. – bopapa_1979 Oct 07 '11 at 17:17
  • Regular expressions are not suited to this. [You need a parser](http://blogs.msdn.com/b/gertd/archive/2008/08/21/getting-to-the-crown-jewels.aspx) – Martin Smith Oct 07 '11 at 17:35
  • 1
    @MartinSmith - Thanks for the solid tip. I did actually find a parser that works very well. It is part of the SubText blog engine, and is its own project within the source code: Subtext.Scripting. Phil Haack wrote his own blog post about it and why he created it. Check it out: http://haacked.com/archive/2007/11/04/a-library-for-executing-sql-scripts-with-go-separators-and.aspx. Also, the source code: http://code.google.com/p/subtext/. Direct download for just the scripting component: http://code.haacked.com/util/Subtext.Scripting.zip. Thanks again for the help! – bopapa_1979 Oct 18 '11 at 16:53

10 Answers10

21

This should work:

(--.*)|(((/\*)+?[\w\W]+?(\*/)+))
TT.
  • 15,774
  • 6
  • 47
  • 88
Jeremy
  • 6,580
  • 2
  • 25
  • 33
  • 3
    No it doesn't. It does not support nested comments as the OP stated. – FailedDev Oct 07 '11 at 17:42
  • @FailedDev - That statement was in regards to a question. It would be a nice-to-have, not really a requierement. – bopapa_1979 Oct 07 '11 at 18:18
  • This doesn't support breaklines – jcvegan Dec 12 '14 at 23:32
  • See my answer below. It works with nested block comments. http://stackoverflow.com/a/33947706/3606250 – drizin Nov 26 '15 at 22:16
  • 4
    If you have strings that contain what would otherwise be a SQL comment, this will match those as well unfortunately, e.g.: `SELECT * FROM foo WHERE a = 'This has -- dashes'` – joebeeson Jun 15 '17 at 15:33
  • I realize this is an old thread, but for anyone who got here via a search engine, I wanted to point out that I needed to escape the forward slash / to make this work, so this worked for me:. `(--.*)|(((\/\*)+?[\w\W]+?(\*\/)+))` [Test it Here](https://regex101.com/r/ThECKu/1) – rbsdca Feb 01 '18 at 09:32
9

In PHP, i'm using this code to uncomment SQL (this is the commented version -> x modifier) :

trim( preg_replace( '@
(([\'"]).*?[^\\\]\2) # $1 : Skip single & double quoted expressions
|(                   # $3 : Match comments
    (?:\#|--).*?$    # - Single line comment
    |                # - Multi line (nested) comments
     /\*             #   . comment open marker
        (?: [^/*]    #   . non comment-marker characters
            |/(?!\*) #   . not a comment open
            |\*(?!/) #   . not a comment close
            |(?R)    #   . recursive case
        )*           #   . repeat eventually
    \*\/             #   . comment close marker
)\s*                 # Trim after comments
|(?<=;)\s+           # Trim after semi-colon
@msx', '$1', $sql ) );

Short version:

trim( preg_replace( '@(([\'"]).*?[^\\\]\2)|((?:\#|--).*?$|/\*(?:[^/*]|/(?!\*)|\*(?!/)|(?R))*\*\/)\s*|(?<=;)\s+@ms', '$1', $sql ) );
Adrien Gibrat
  • 917
  • 10
  • 13
6

Using this code :

StringCollection resultList = new StringCollection(); 
try {
Regex regexObj = new Regex(@"/\*(?>(?:(?!\*/|/\*).)*)(?>(?:/\*(?>(?:(?!\*/|/\*).)*)\*/(?>(?:(?!\*/|/\*).)*))*).*?\*/|--.*?\r?[\n]", RegexOptions.Singleline);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
    resultList.Add(matchResult.Value);
    matchResult = matchResult.NextMatch();
} 
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}

With the following input :

-- This is Comment 1
SELECT Foo FROM Bar
GO

-- This is
-- Comment 2
UPDATE Bar SET Foo == 'Foo'
GO

/* This is Comment 3 */
DELETE FROM Bar WHERE Foo = 'Foo'

/* This is a
multi-line comment */
DROP TABLE Bar

/* comment /* nesting */ of /* two */ levels supported */
foo...

Produces these matches :

-- This is Comment 1
-- This is
-- Comment 2
/* This is Comment 3 */
/* This is a
multi-line comment */
/* comment /* nesting */ of /* two */ levels supported */

Not that this will only match 2 levels of nested comments, although in my life I have never seen more than one level being used. Ever.

FailedDev
  • 26,680
  • 9
  • 53
  • 73
  • 1
    This regex doesn't match a `--` comment if it appears on the last-line of a string without a trailing `\n`. – Dai Apr 18 '21 at 21:40
5

I made this function that removes all SQL comments, using plain regular expressons. It removes both line comments (even when there is not a linebreak after) and block comments (even if there are nested block comments). This function can also replace literals (useful if you are searching for something inside SQL procedures but you want to ignore strings).

My code was based on this answer (which is about C# comments), so I had to change line comments from "//" to "--", but more importantly I had to rewrite the block comments regex (using balancing groups) because SQL allows nested block comments, while C# doesn't.

Also, I have this "preservePositions" argument, which instead of stripping out the comments it just fills comments with whitespace. That's useful if you want to preserve the original position of each SQL command, in case you need to manipulate the original script while preserving original comments.

Regex everythingExceptNewLines = new Regex("[^\r\n]");
public string RemoveComments(string input, bool preservePositions, bool removeLiterals=false)
{
    //based on https://stackoverflow.com/questions/3524317/regex-to-strip-line-comments-from-c-sharp/3524689#3524689

    var lineComments = @"--(.*?)\r?\n";
    var lineCommentsOnLastLine = @"--(.*?)$"; // because it's possible that there's no \r\n after the last line comment
    // literals ('literals'), bracketedIdentifiers ([object]) and quotedIdentifiers ("object"), they follow the same structure:
    // there's the start character, any consecutive pairs of closing characters are considered part of the literal/identifier, and then comes the closing character
    var literals = @"('(('')|[^'])*')"; // 'John', 'O''malley''s', etc
    var bracketedIdentifiers = @"\[((\]\])|[^\]])* \]"; // [object], [ % object]] ], etc
    var quotedIdentifiers = @"(\""((\""\"")|[^""])*\"")"; // "object", "object[]", etc - when QUOTED_IDENTIFIER is set to ON, they are identifiers, else they are literals
    //var blockComments = @"/\*(.*?)\*/";  //the original code was for C#, but Microsoft SQL allows a nested block comments // //https://msdn.microsoft.com/en-us/library/ms178623.aspx
    //so we should use balancing groups // http://weblogs.asp.net/whaggard/377025
    var nestedBlockComments = @"/\*
                                (?>
                                /\*  (?<LEVEL>)      # On opening push level
                                | 
                                \*/ (?<-LEVEL>)     # On closing pop level
                                |
                                (?! /\* | \*/ ) . # Match any char unless the opening and closing strings   
                                )+                         # /* or */ in the lookahead string
                                (?(LEVEL)(?!))             # If level exists then fail
                                \*/";

    string noComments = Regex.Replace(input,
            nestedBlockComments + "|" + lineComments + "|" + lineCommentsOnLastLine + "|" + literals + "|" + bracketedIdentifiers + "|" + quotedIdentifiers,
        me => {
            if (me.Value.StartsWith("/*") && preservePositions)
                return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks // return new string(' ', me.Value.Length);
            else if (me.Value.StartsWith("/*") && !preservePositions)
                return "";
            else if (me.Value.StartsWith("--") && preservePositions)
                return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks
            else if (me.Value.StartsWith("--") && !preservePositions)
                return everythingExceptNewLines.Replace(me.Value, ""); // preserve only line-breaks // Environment.NewLine;
            else if (me.Value.StartsWith("[") || me.Value.StartsWith("\""))
                return me.Value; // do not remove object identifiers ever
            else if (!removeLiterals) // Keep the literal strings
                return me.Value;
            else if (removeLiterals && preservePositions) // remove literals, but preserving positions and line-breaks
            {
                var literalWithLineBreaks = everythingExceptNewLines.Replace(me.Value, " ");
                return "'" + literalWithLineBreaks.Substring(1, literalWithLineBreaks.Length - 2) + "'";
            }
            else if (removeLiterals && !preservePositions) // wrap completely all literals
                return "''";
            else
                throw new NotImplementedException();
        },
        RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
    return noComments;
}

Test 1 (first original, then removing comments, last removing comments/literals)

[select /* block comment */ top 1 'a' /* block comment /* nested block comment */*/ from  sys.tables --LineComment
union
select top 1 '/* literal with */-- lots of comments symbols' from sys.tables --FinalLineComment]

[select                     top 1 'a'                                               from  sys.tables              
union
select top 1 '/* literal with */-- lots of comments symbols' from sys.tables                   ]

[select                     top 1 ' '                                               from  sys.tables              
union
select top 1 '                                             ' from sys.tables                   ]

Test 2 (first original, then removing comments, last removing comments/literals)

Original:
[create table [/*] /* 
  -- huh? */
(
    "--
     --" integer identity, -- /*
    [*/] varchar(20) /* -- */
         default '*/ /* -- */' /* /* /* */ */ */
);
            go]


[create table [/*]    

(
    "--
     --" integer identity,      
    [*/] varchar(20)         
         default '*/ /* -- */'                  
);
            go]


[create table [/*]    

(
    "--
     --" integer identity,      
    [*/] varchar(20)         
         default '           '                  
);
            go]
Community
  • 1
  • 1
drizin
  • 1,737
  • 1
  • 18
  • 44
  • Taking this from your code: Regex rex = new Regex(@"/\*(?>/\*(?)|\*/(?<-LEVEL>)|(?! /\* | \*/ ).)+(?(LEVEL)(?!))\*/", RegexOptions.Singleline | RegexOptions.CultureInvariant); MatchCollection mcoll = rex.Matches(thestring, 0); Strips all nested block comments. Thank you. It's the only one I found that actually works as intended. – BeanFlicker Dec 06 '15 at 17:37
  • 1
    David, I'm glad it worked. But please note that using all 6 regexs at the same time is important, because a block comment could be located inside a string literal (e.g.: select '/* this is a literal */' from sys.tables), or even inside a line comment. So my suggestion is to change the lambda function to extract exactly what you want. – drizin Dec 06 '15 at 22:47
  • Doesn't seem to work with this string: /* Stuff with nested comments /* wooooo \*/ Wonder whut comes next /* woot 2*/ create proc in dis with encryption */ CREATE procedure dbo.Woiotninntu with encryption as begin select * from master end – BeanFlicker Jan 31 '16 at 22:19
  • This works: Regex rex = new Regex(@"/\*(?>/\*(?)|\*/(?<-LEVEL>)|(?! /\*|\*/).)*(?(LEVEL)(?!))\*/", RegexOptions.Singleline | RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace); – BeanFlicker Feb 01 '16 at 15:30
  • This is very good one but when you have a comment like this `/* exec Some_Proc '123456789c', 'A86A23CE-F068-450B-9356-A21197B7A715|CD7EF1CA-3247-492B-804B-658C2473867D|B118E687-6CBB-4438-A407-80DE93995885' */` it will remove everything but `|` – Gayan Jul 15 '16 at 07:25
  • David, I've just tested now and it's working for me. Are you using the complete code? – drizin Jul 17 '16 at 02:50
  • 1
    @GayanRanasinghe, you are right. There was an error in my everythingExceptNewLines regex, it was actually NOT replacing the pipe character. I've just fixed the regex, and it´s working now. Thanks for the alert. – drizin Jul 17 '16 at 02:52
  • @drizin can you consolidate the two regexes for single line comments by adding the `RegexOptions.Multiline` option as well and then only using the latter? – Arin Taylor Jul 30 '17 at 21:14
  • I don't get the *quotedidentifiers* part here. Using RegExBuddy with either PCRE or C# syntax, (\""((\""\"")|[^""])*\"") does not match CREATE TABLE "select" ("identity" INT IDENTITY NOT NULL, "order" INT NOT NULL); – Jan Doggen Dec 14 '20 at 09:40
  • @JanDoggen I'm not familiar with RegExBuddy but I assume it's just about quoting correctly (since my code above is C#). I've tried with Ultrapico Expresso the following regex ("(("")|[^"])*") and it worked fine - it could identify all your 3 quoted identifiers ("select","identity", "order"). But it doesn't match the whole DDL statement - maybe you're missing the purpose here? The purpose of the code above is to remove SQL comments, and in case it's important to identify boundaries of literals and identifiers so that their contents does not get removed incorrectly as comments. – drizin Dec 14 '20 at 14:41
  • Ah, of course. In any RegEx tool or in my Delphi implemention I don't need to write "". Then it works fine ;-) – Jan Doggen Dec 15 '20 at 08:05
1

I see you're using Microsoft's SQL Server (as opposed to Oracle or MySQL). If you relax the regex requirement, it's now possible (since 2012) to use Microsoft's own parser:

using Microsoft.SqlServer.Management.TransactSql.ScriptDom;

...

public string StripCommentsFromSQL( string SQL ) {

    TSql110Parser parser = new TSql110Parser( true );
    IList<ParseError> errors;
    var fragments = parser.Parse( new System.IO.StringReader( SQL ), out errors );

    // clear comments
    string result = string.Join ( 
      string.Empty,
      fragments.ScriptTokenStream
          .Where( x => x.TokenType != TSqlTokenType.MultilineComment )
          .Where( x => x.TokenType != TSqlTokenType.SingleLineComment )
          .Select( x => x.Text ) );

    return result;

}

See Removing Comments From SQL

Michael J Swart
  • 3,060
  • 3
  • 29
  • 46
  • 1
    I tried this method and it works pretty well, however the performance is suffering. I compared my regex ([see my answer below](http://stackoverflow.com/a/33947706/3606250)) to this Microsoft parser, and I achieved the same results (on more than 3 thousand scripts) but with a fraction of time. – drizin Nov 27 '15 at 14:31
1

The following works fine - pg-minify, and not only for PostgreSQL, but for MS-SQL also.

Presumably, if we remove comments, that means the script is no longer for reading, and minifying it at the same time is a good idea.

That library deletes all comments as part of the script minification.

vitaly-t
  • 24,279
  • 15
  • 116
  • 138
1

Following up from Jeremy's answer and inspired by Adrien Gibrat's answer.

This is my version that supports comment characters inside single-quoted strings.

.NET C# note you need to enable RegexOptions.IgnorePatternWhitespace , most other languages this is the x option

(?: (?:'[^']*?') | (?<singleline>--[^\n]*) | (?<multiline>(?:\/\*)+?[\w\W]+?(?:\*\/)+) )

Example https://regex101.com/r/GMUAnc/3

  • Just realised you could end up with a trailing \r at the end of single line comment captures against CrLf encoded strings, might be good to trim the capture before using it in this case. – Gaspard Leon Sep 09 '22 at 01:45
  • to filter on only the comments I'm using foreach (Match match in commentRegEx.Matches(stringToMatch).Where(m => m.Groups["singleline"].Success || m.Groups["multiline"].Success)) { // do something with match.Value } – Gaspard Leon Sep 09 '22 at 05:13
1

This works for me:

(/\*(.|[\r\n])*?\*/)|(--(.*|[\r\n]))

It matches all comments starting with -- or enclosed within */ .. */ blocks

animuson
  • 53,861
  • 28
  • 137
  • 147
Intrepid
  • 2,781
  • 2
  • 29
  • 54
  • 3
    As in [your question](http://stackoverflow.com/questions/8375646/using-regular-expressions-to-scan-t-sql-for-object-dependencies) regular expressions are **not** suitable for this. e.g. `SELECT '/* This is not a comment */' FROM (SELECT 1 AS C) [/* nor is this */]` is perfectly valid TSQL. – Martin Smith Dec 04 '11 at 15:29
0

I am using this java code to remove all sql comments from text. It supports comments like /* ... */ , --..., nested comments, ignores comments inside quoted strings

  public static String stripComments(String sqlCommand) {
    StringBuilder result = new StringBuilder();
    //group 1 must be quoted string
    Pattern pattern = Pattern.compile("('(''|[^'])*')|(/\\*(.|[\\r\\n])*?\\*/)|(--(.*|[\\r\\n]))");
    Matcher matcher = pattern.matcher(sqlCommand);
    int prevIndex = 0;
    while(matcher.find()) {
      // add previous portion of string that was not found by regexp - meaning this is not a quoted string and not a comment
      result.append(sqlCommand, prevIndex, matcher.start());
      prevIndex = matcher.end();
      // add the quoted string
      if (matcher.group(1) != null) {
        result.append(sqlCommand, matcher.start(), matcher.end());
      }
    }
    result.append(sqlCommand.substring(prevIndex));
    return result.toString();
  }
Michael
  • 2,835
  • 2
  • 9
  • 15
0

Another option that worked for me:

[\t\r\n]|(--[^\r\n]*)|(/\*[\w\W]*?(?=\*/)\*/)

or

[\t]|(--[^\r\n]*)|(/\*[\w\W]*?(?=\*/)\*/)

(taken from this article)

gce
  • 1,563
  • 15
  • 17