3

Question:

Can anybody give me a working regex expression (C#/VB.NET) that can remove single line comments from a SQL statement ?

I mean these comments:

-- This is a comment

not those

/* this is a comment */

because I already can handle the star comments.

I have a made a little parser that removes those comments when they are at the start of the line, but they can also be somewhere after code or worse, in a SQL-string 'hello --Test -- World' Those comments should also be removed (except those in a SQL string of course - if possible).

Surprisingly I didn't got the regex working. I would have assumed the star comments to be more difficult, but actually, they aren't.

As per request, here my code to remove /**/-style comments (In order to have it ignore SQL-Style strings, you have to subsitute strings with a uniqueidentifier (i used 4 concated), then apply the comment-removal, then apply string-backsubstitution.

    static string RemoveCstyleComments(string strInput) 
    { 
        string strPattern = @"/[*][\w\d\s]+[*]/"; 
        //strPattern = @"/\*.*?\*/"; // Doesn't work 
        //strPattern = "/\\*.*?\\*/"; // Doesn't work 
        //strPattern = @"/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/ "; // Doesn't work 
        //strPattern = @"/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/ "; // Doesn't work 

        // http://stackoverflow.com/questions/462843/improving-fixing-a-regex-for-c-style-block-comments 
        strPattern = @"/\*(?>(?:(?>[^*]+)|\*(?!/))*)\*/";  // Works ! 

        string strOutput = System.Text.RegularExpressions.Regex.Replace(strInput, strPattern, string.Empty, System.Text.RegularExpressions.RegexOptions.Multiline); 
        Console.WriteLine(strOutput); 
        return strOutput; 
    } // End Function RemoveCstyleComments 
Stefan Steiger
  • 78,642
  • 66
  • 377
  • 442

7 Answers7

8

I will disappoint all of you. This can't be done with regular expressions. Sure, it's easy to find comments not in a string (that even the OP could do), the real deal is comments in a string. There is a little hope of the look arounds, but that's still not enough. By telling that you have a preceding quote in a line won't guarantee anything. The only thing what guarantees you something is the oddity of quotes. Something you can't find with regular expression. So just simply go with non-regular-expression approach.

EDIT: Here's the c# code:

        String sql = "--this is a test\r\nselect stuff where substaff like '--this comment should stay' --this should be removed\r\n";
        char[] quotes = { '\'', '"'};
        int newCommentLiteral, lastCommentLiteral = 0;
        while ((newCommentLiteral = sql.IndexOf("--", lastCommentLiteral)) != -1)
        {
            int countQuotes = sql.Substring(lastCommentLiteral, newCommentLiteral - lastCommentLiteral).Split(quotes).Length - 1;
            if (countQuotes % 2 == 0) //this is a comment, since there's an even number of quotes preceding
            {
                int eol = sql.IndexOf("\r\n") + 2;
                if (eol == -1)
                    eol = sql.Length; //no more newline, meaning end of the string
                sql = sql.Remove(newCommentLiteral, eol - newCommentLiteral);
                lastCommentLiteral = newCommentLiteral;
            }
            else //this is within a string, find string ending and moving to it
            {
                int singleQuote = sql.IndexOf("'", newCommentLiteral);
                if (singleQuote == -1)
                    singleQuote = sql.Length;
                int doubleQuote = sql.IndexOf('"', newCommentLiteral);
                if (doubleQuote == -1)
                    doubleQuote = sql.Length;

                lastCommentLiteral = Math.Min(singleQuote, doubleQuote) + 1;

                //instead of finding the end of the string you could simply do += 2 but the program will become slightly slower
            }
        }

        Console.WriteLine(sql);

What this does: find every comment literal. For each, check if it's within a comment or not, by counting the number of quotes between the current match and the last one. If this number is even, then it's a comment, thus remove it (find first end of line and remove whats between). If it's odd, this is within a string, find the end of the string and move to it. Rgis snippet is based on a wierd SQL trick: 'this" is a valid string. Even tho the 2 quotes differ. If it's not true for your SQL language, you should try a completely different approach. I'll write a program to that too if that's the case, but this one's faster and more straightforward.

Máthé Endre-Botond
  • 4,826
  • 2
  • 29
  • 48
  • I guess it can be done using regular expressions. Please check my answer here: http://stackoverflow.com/a/33947706/3606250 – drizin Nov 26 '15 at 22:24
  • Just so anybody copy-pasting knows, this fails on: string sql = "SELECT 123 as abc, 'Hello foo /*bar*/ my --world ' AS xyz --Hello"; – Stefan Steiger Sep 10 '19 at 14:53
3

This seems to work well for me so far; it even ignores comments within strings, such as SELECT '--not a comment--' FROM ATable

    private static string removeComments(string sql)
    {
        string pattern = @"(?<=^ ([^'""] |['][^']*['] |[""][^""]*[""])*) (--.*$|/\*(.|\n)*?\*/)";
        return Regex.Replace(sql, pattern, "", RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
    }

Note: it is designed to eliminate both /**/-style comments as well as -- style. Remove |/\*(.|\n)*?\*/ to get rid of the /**/ checking. Also be sure you are using the RegexOptions.IgnorePatternWhitespace Regex option!!

I wanted to be able to handle double-quotes too, but since T-SQL doesn't support them, you could get rid of |[""][^""]*[""] too.

Adapted from here.

Note (Mar 2015): In the end, I wound up using Antlr, a parser generator, for this project. There may have been some edge cases where the regex didn't work. In the end I was much more confident with the results having used Antlr, and it's worked well.

Community
  • 1
  • 1
transistor1
  • 2,915
  • 26
  • 42
3

You want something like this for the simple case

-{2,}.*

The -{2,} looks for a dash that happens 2 or more times

The .* gets the rest of the lines up to the newline

*But, for the edge cases, it appears that SinistraD is correct in that you cannot catch everything, however here is an article about how this can be done in C# with a combination of code and regex.

Justin Pihony
  • 66,056
  • 18
  • 147
  • 180
  • I don't think this will leave 'hello --Test -- World' alone. – Kramii Mar 23 '12 at 16:43
  • @Kramii This is fixed now. I had missed that. To whoever downvoted, please re-review if you are still looking :) – Justin Pihony Mar 23 '12 at 16:48
  • Close, but you've missed the case where '' appears in a string. For example, "WHERE name LIKE ' '' -- this is not a comment, but will also be matched'.". – Kramii Mar 23 '12 at 16:56
  • @Kramii This was due to the spacing. I believe this should be good now. It works against your example now at least – Justin Pihony Mar 23 '12 at 17:00
  • Sorry! WHERE name LIKE ' ''x-- this is not a comment, but will also be matched'. – Kramii Mar 23 '12 at 17:03
  • @JustinPihony you're really having to work hard for this one aren't you :) – Robbie Mar 23 '12 at 17:14
  • There are several things wrong: `[...]` is a [character class](http://www.regular-expressions.info/charclass.html), and does not contain any look-behind/ahead assertions. Even if the look arounds worked, you would not be able to make a distinction between a `'--'` inside a string literal and outside of it with a single regex pattern. You need a lexical analyzer for such a job (which consists of multiple regex patterns, not a single one!). – Bart Kiers Mar 23 '12 at 17:17
  • For the record: `[^(?!' ).*(?<!')]` matches a single character other than `(`, `?`, `!`, `'`, `)`, `.`, `*`, `<` or a space. – Bart Kiers Mar 23 '12 at 17:20
  • @BartKiers Agreed after some more research. I have updated the answer – Justin Pihony Mar 23 '12 at 17:22
  • @Kramii It appears after some research this is not really doable with just a single regex as others have said...I have updated my answer to relate this – Justin Pihony Mar 23 '12 at 17:23
1
Using System.Text.RegularExpressions;

public static string RemoveSQLCommentCallback(Match SQLLineMatch)
{
    System.Text.StringBuilder sb = new System.Text.StringBuilder();
    bool open = false; //opening of SQL String found
    char prev_ch = ' ';

    foreach (char ch in SQLLineMatch.ToString())
    {
        if (ch == '\'')
        {
            open = !open;
        }
        else if ((!open && prev_ch == '-' && ch == '-'))
        {
            break;
        }
        sb.Append(ch);
        prev_ch = ch;
    }

    return sb.ToString().Trim('-');
}

The code

public static void Main()
{
    string sqlText = "WHERE DEPT_NAME LIKE '--Test--' AND START_DATE < SYSDATE -- Don't go over today";
    //for every matching line call callback func
    string result = Regex.Replace(sqlText, ".*--.*", RemoveSQLCommentCallback);
}

Let's replace, find all the lines that match dash dash comment and call your parsing function for every match.

Marko
  • 20,385
  • 13
  • 48
  • 64
Symon
  • 11
  • 2
1

As a late solution, the simplest way is to do it using ScriptDom-TSqlParser:

// https://michaeljswart.com/2014/04/removing-comments-from-sql/
// http://web.archive.org/web/*/https://michaeljswart.com/2014/04/removing-comments-from-sql/
public static string StripCommentsFromSQL(string SQL)
{
    Microsoft.SqlServer.TransactSql.ScriptDom.TSql150Parser parser = 
        new Microsoft.SqlServer.TransactSql.ScriptDom.TSql150Parser(true);

    System.Collections.Generic.IList<Microsoft.SqlServer.TransactSql.ScriptDom.ParseError> errors;


    Microsoft.SqlServer.TransactSql.ScriptDom.TSqlFragment fragments = 
        parser.Parse(new System.IO.StringReader(SQL), out errors);

    // clear comments
    string result = string.Join(
      string.Empty,
      fragments.ScriptTokenStream
          .Where(x => x.TokenType != Microsoft.SqlServer.TransactSql.ScriptDom.TSqlTokenType.MultilineComment)
          .Where(x => x.TokenType != Microsoft.SqlServer.TransactSql.ScriptDom.TSqlTokenType.SingleLineComment)
          .Select(x => x.Text));

    return result;

}

or instead of using the Microsoft-Parser, you can use ANTL4 TSqlLexer

or without any parser at all:

private static System.Text.RegularExpressions.Regex everythingExceptNewLines = 
    new System.Text.RegularExpressions.Regex("[^\r\n]");


// http://drizin.io/Removing-comments-from-SQL-scripts/
// http://web.archive.org/web/*/http://drizin.io/Removing-comments-from-SQL-scripts/
public static string RemoveComments(string input, bool preservePositions, bool removeLiterals = false)
{
    //based on http://stackoverflow.com/questions/3524317/regex-to-strip-line-comments-from-c-sharp/3524689#3524689
    var lineComments = @"--(.*?)\r?\n";
    var lineCommentsOnLastLine = @"--(.*?)$"; // because it's possible that there's no \r\n after the last line comment
                                              // literals ('literals'), bracketedIdentifiers ([object]) and quotedIdentifiers ("object"), they follow the same structure:
                                              // there's the start character, any consecutive pairs of closing characters are considered part of the literal/identifier, and then comes the closing character
    var literals = @"('(('')|[^'])*')"; // 'John', 'O''malley''s', etc
    var bracketedIdentifiers = @"\[((\]\])|[^\]])* \]"; // [object], [ % object]] ], etc
    var quotedIdentifiers = @"(\""((\""\"")|[^""])*\"")"; // "object", "object[]", etc - when QUOTED_IDENTIFIER is set to ON, they are identifiers, else they are literals
                                                          //var blockComments = @"/\*(.*?)\*/";  //the original code was for C#, but Microsoft SQL allows a nested block comments // //https://msdn.microsoft.com/en-us/library/ms178623.aspx

    //so we should use balancing groups // http://weblogs.asp.net/whaggard/377025
    var nestedBlockComments = @"/\*
                         (?>
                         /\*  (?<LEVEL>)      # On opening push level
                         | 
                         \*/ (?<-LEVEL>)     # On closing pop level
                         |
                         (?! /\* | \*/ ) . # Match any char unless the opening and closing strings   
                         )+                         # /* or */ in the lookahead string
                         (?(LEVEL)(?!))             # If level exists then fail
                         \*/";

    string noComments = System.Text.RegularExpressions.Regex.Replace(input,
        nestedBlockComments + "|" + lineComments + "|" + lineCommentsOnLastLine + "|" + literals + "|" + bracketedIdentifiers + "|" + quotedIdentifiers,
        me => {
            if (me.Value.StartsWith("/*") && preservePositions)
                return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks // return new string(' ', me.Value.Length);
     else if (me.Value.StartsWith("/*") && !preservePositions)
                return "";
            else if (me.Value.StartsWith("--") && preservePositions)
                return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks
     else if (me.Value.StartsWith("--") && !preservePositions)
                return everythingExceptNewLines.Replace(me.Value, ""); // preserve only line-breaks // Environment.NewLine;
     else if (me.Value.StartsWith("[") || me.Value.StartsWith("\""))
                return me.Value; // do not remove object identifiers ever
     else if (!removeLiterals) // Keep the literal strings
         return me.Value;
            else if (removeLiterals && preservePositions) // remove literals, but preserving positions and line-breaks
     {
                var literalWithLineBreaks = everythingExceptNewLines.Replace(me.Value, " ");
                return "'" + literalWithLineBreaks.Substring(1, literalWithLineBreaks.Length - 2) + "'";
            }
            else if (removeLiterals && !preservePositions) // wrap completely all literals
         return "''";
            else
                throw new System.NotImplementedException();
        },
        System.Text.RegularExpressions.RegexOptions.Singleline | System.Text.RegularExpressions.RegexOptions.IgnorePatternWhitespace);
    return noComments;
}
Stefan Steiger
  • 78,642
  • 66
  • 377
  • 442
0

In PHP, i'm using this code to uncomment SQL (only single line):

$sqlComments = '@(([\'"`]).*?[^\\\]\2)|((?:\#|--).*?$)\s*|(?<=;)\s+@ms';
/* Commented version
$sqlComments = '@
    (([\'"`]).*?[^\\\]\2) # $1 : Skip single & double quoted + backticked expressions
    |((?:\#|--).*?$)      # $3 : Match single line comments
    \s*                   # Trim after comments
    |(?<=;)\s+            # Trim after semi-colon
    @msx';
*/
$uncommentedSQL = trim( preg_replace( $sqlComments, '$1', $sql ) );
preg_match_all( $sqlComments, $sql, $comments );
$extractedComments = array_filter( $comments[ 3 ] );
var_dump( $uncommentedSQL, $extractedComments );

To remove all comments see Regex to match MySQL comments

Community
  • 1
  • 1
Adrien Gibrat
  • 917
  • 10
  • 13
0

I don't know if C#/VB.net regex is special in some way but traditionally s/--.*// should work.

glindste
  • 1,427
  • 12
  • 14