1

I have a block of text from which I need to extract & replace certain occurrences of text. The pattern I'm looking for has 5 components in this sequence:

1) /*<<@*/
2) any characters & symbols except this symbol combo: /*
3) /*
4) any upper or lower case letter, number, space or underscore
5) */

For example, I am so far unable to devise a regex pattern that can extract the 3 occurrences of the pattern from this text:

DECLARE @myDate DATETIME = /*<<@*/ '2018-07-20 00:00:00' /*My Date>>*/
DECLARE @myString VARCHAR(MAX) = /*<<@*/ 'whatever?' /*My String>>*/ DECLARE @isTrue VARCHAR(MAX) = /*<<@*/ 1 /*My Bool>>*/

These are the 3 occurrences that should be found:

1) /*<<@*/ '2018-07-20 00:00:00' /*My Date>>*/
2) /*<<@*/ 'whatever?' /*My String>>*/
3) /*<<@*/ 1 /*My Bool>>*/

But I always get 2 occurrences -- the second line is considered a single match instead of 2 matches:

1) /*<<@*/ '2018-07-20 00:00:00' /*My Date>>*/
2) /*<<@*/ 'whatever?' /*My String>>*/ DECLARE @isTrue VARCHAR(MAX) = /*<<@*/ 1 /*My Bool>>*/

Here is an example regex pattern, one of many that I've tried:

(\/\*<<@\*\/){1}(.*){1}([a-z]|[A-Z]|[0-9]|_|\s)*(>>\*\/){1}

If I move the 3rd DECLARE onto its own line, it works (because the . symbol stops at line returns), but I need to be able to extract the occurrences separately when they are on the same line.

I've tested all my patterns against the text using regexr.com and regexstorm.net. My patterns break down on the second component: I can find no way to include in the pattern any characters or symbols except /*, the regex always grabs too much. I've tried negative lookaheads for /*. I've tried explicitly specifying all valid characters but I couldn't find a way to NOT match the /* combo.

Any help will be appreciated!

JarNat
  • 13
  • 3
  • `any characters & symbols except this symbol combo: /*` and `any upper or lower case letter, number, space or underscore` You might be trying to over protect any overflow of records. Really, you just need to protect against the beginning of a new record. That is this `/\*<<@\*/((?:(?!/\*<<@\*/).)*?)/\*(?!<<@\*/)(.*?)>>\*/` If you have fairly uniform data, I would simplify the whole thing [/\*<<@\*/(.*?)/\*(.*?)>>\*/](https://regex101.com/r/I6YCK5/1) –  Jul 24 '18 at 00:13

2 Answers2

1

This seems to work for me: (\/\*<<@\*\/)((?:[^\/]|\/(?!\*))+?)(\/\*)((?:[^*]|\*(?!\/))+?)(\*\/)

produces 5 capture'd groups as seen here: https://regex101.com/r/rd1Tl9/1

The key aspect is this pattern: ((?:[^\/]|\/(?!\*))+?) which says: find any character that's not a /, or find a \ that does not have a * right afterwards.

This allows you to grab portions that don't match your delimiters.

AnilRedshift
  • 7,937
  • 7
  • 35
  • 59
  • The first answer from Wiktor works pretty good -- but I chose this one as the answer because it handles the case where my component 2 (the parameter value) has a line return in it. – JarNat Jul 24 '18 at 16:26
1

You may use the following regex:

/\*<<@\*/(?:(?!/\*)[\s\S])*?/\*+[^*]*\*+(?:[^/*][^*]*\*+)*/

See the regex demo. If you need to use the regex as a regex literal, remember to escape forward slashes:

/\/\*<<@\*\/(?:(?!\/\*)[\s\S])*?\/\*+[^*]*\*+(?:[^\/*][^*]*\*+)*\//

If you need to use it in C# define it as

var pattern = @"(?s)/\*<<@\*/(?:(?!/\*).)*?/\*+[^*]*\*+(?:[^/*][^*]*\*+)*/";

Details

  • /\*<<@\*/ - a literal /*<<@*/ substring
  • (?:(?!/\*)[\s\S])*? - any char, zero or more occurrences, as few as possible, that does not start a /* sequence
  • /\*+[^*]*\*+(?:[^/*][^*]*\*+)*/ - a C-style comment regex.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • ah this is a nicer way of grouping multi-character delimiters. I'll keep this in mind for the future. – AnilRedshift Jul 23 '18 at 22:56
  • This answer is a good one but I marked AnilRedshift's reply as the answer because it handles the case where my component 2 has a line return in it. – JarNat Jul 24 '18 at 16:27
  • @JarNat That is not a problem. Use the `RegexOptions.Singleline` in the .NET method or add `(?s)` at the pattern start. Or replace `.`with`[\s\S]`. – Wiktor Stribiżew Jul 24 '18 at 16:36
  • @JarNat Actually, the most efficient regex for this is `/\/\*<<@\*\/[^\/]*(?:\/(?!\*)[^\/]*)*\/\*+[^*]*\*+(?:[^\/*][^*]*\*+)*\//` – Wiktor Stribiżew Jul 24 '18 at 16:42