1

I need to split T-SQL script files along the batch delimiter boundary and execute each segment separately. Splitting is easy. GO statements need to be alone on a line and can only be preceded and followed by white space. (Exception: GO can also be followed by a number but not a variable. I am ignoring this case for now. I am also ignoring for now the case of an inline comment after GO)

However, GO statements can be found inside block comments (where they are obviously ignored). Splitting along that boundary will result in broken code. I want to check and reject the script file if it contains a GO statement inside a block comment.

So far I have built this regex:

(\/\*)(.?([^\*][^\/])*?)(^(\s*?)go(\s*?)$)(.?([^\/][^\*])*?)(?=(\*\/))

It almost works, but it still has issues.

Splitting and identifying non compliant files will be done in PowerShell.

Note: At this stage it has been established we will not YET pursue using a parser. Possible parser options are still being investigated

I tested in regex101.com using this by adding and removing letters:

/*
llaa

GO


*/

GO

/*
a
*/

GO
a
/*

GOaaa

*/
a
GO

/*

a

*/
mklement0
  • 382,024
  • 64
  • 607
  • 775
alexvoda
  • 33
  • 5
  • 1
    The issues seem to be related to the number of characters (odd or even) and are probably caused by the `([^\*][^\/])` group. I attempted to mitigate this using `.?` but it still does not work. – alexvoda Dec 05 '18 at 17:13

2 Answers2

1

Try the following regex, which should only yield $true if a comment-internal GO is found; note that it also detects GO followed by a (decimal) number correctly:

@'
/* a comment with GO, but not on its own line */

/* This GO should be found.
   GO 12  
*/

/* This one is outside a comment */
GO
'@ -match '(?sm)/[*](.(?![*]/))+?^\s*go(\s+\d+)?\s*$'

The above yields $true due to the presence of the comment-embedded GO 12.

  • (?sm) turns on inline options s (make . match \n too) and m (make ^ and $ match the start and end of lines too).

  • /[*] matches the opening of a block comment (* is a metacharacter that must be escaped (\*) in order to be interpreted literally or specified inside a character set ([...]), as here).

  • (.(?![*]/))+? matches a single character (.) not followed by literal */ (using (?!...), a negative lookahead), one or more times (+), but non-greedily (?).

    • This is the key to matching a GO line truly only inside a block comment.
  • ^\s*go matches the start of a line (^), followed by a possibly empty run of whitespace (\s*), followed by literal go (note that PowerShell's -match operator is case-insensitive).

  • (\s+\d+)? optionally (?) matches a nonempty run of whitespace (\s+) followed by one or more (+) digits (\d).

  • \s*$ matches a possibly empty run of whitespace through to the line's end.

  • Assuming that all block comments are well-formed, there's no need to match the remainder of the comment.


For going beyond just rejecting undesired input, TheMadTechnician suggests using -split, which can be used to effectively eliminate those block comments that have embedded GO lines from the input:

$sanitized = @'
/* a comment with GO, but not on its own line */
before
/* This GO should be found.
   GO 12  
*/
after
/* This one is outside a comment */
GO
...
/* Another comment with a GO.
  foo
GO
*/
last
'@ -split '(?sm)/[*](?:.(?![*]/))+?^\s*go(?:\s+\d+)?\s*$.+?[*]/' -join ''

The above stores the following in variable $sanitized - note that the block comments with embedded GO statements are gone:

/* a comment with GO, but not on its own line */
before

after
/* This one is outside a comment */
GO
...

last

If you then want to break the resulting script into the constituent batches by the remaining - uncommented, effective - GO statements:

$sanitized -split '(?m)^\s*go(?:\s+\d+)?\s*$'

As you point out, GO isn't actually a part of T-SQL:

 GO is not a Transact-SQL statement; it is a command recognized by the sqlcmd and osql utilities and SQL Server Management Studio Code editor


As for what you tried:

Your /\*(.?([^*][^/])*?)^\s*?go subexpression (simplified here) intended to match the start of a block comment up to an embedded GO is ineffective at ensuring that substring */ is not present; it produces both false positives and false negative.

Example of false positive (matches, but shouldn't):

/*a*/
go

Example of a false negative (doesn't match, but should):

/*a*
go

As you suspected in a comment, the problem is that [^*][^/] matches a pair of characters, so the matching behavior ultimately depends on whether the number of input characters is odd or even; using simplified examples:

# Even number of chars. -> $false, as intended
'*/' -match '^(.?([^*][^/])*?)$'

# Odd number of chars. -> $true(!)
'*/a' -match '^(.?([^*][^/])*?)$'

Only a negative lookahead assertion, as shown above, can reliably exclude a given (multi-character) string.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • Glad to hear it, @alexvoda - it took a while to get there, but it certainly was a learning experience for me too. – mklement0 Dec 05 '18 at 23:33
0

My suggestion would be to remove the comments then split the code.

Sean Pearce
  • 1,150
  • 5
  • 10
  • The question is an ask to validate code; however [mklement0 's answer](https://stackoverflow.com/a/53638389/4496560) did both validation and specifically remove the comments with go in them before splitting. – Gregor y Dec 07 '18 at 16:57