2

I have a string that is formed from tag-substitution, which also results in parts of the string being marked for deletion, for example:

Keep1
{/*DELETE}
Delete1a
    {/*DELETE}
    Delete2
    {DELETE*/}
Delete1b
{DELETE*/}
Keep2
{/*DELETE}
Delete3
{DELETE*/}
Keep3

Am I correct that a RegEx cannot be used to select only the inner DELETE2 and DELETE3, remove those, and then repeat to get the DELETE1a/b until no further matches are found?

The RegEx I am passing to my replace function is

\{\/\*DELETE\}([\s\S]*?)\{DELETE\*\/\}

This matches

{/*DELETE}
Delete1a
    {/*DELETE}
    Delete2
    {DELETE*/}

If this is the only RegEx match that I can make I could [suppress the leading {/*DELETE} and] call the replace function recursively which, I think, would enable me to remove the nested {TAGS}

Is a better way?

I am using the RegEx in VBScript

EDIT: In case it helps I can change the {/*DELETE} and {DELETE*/} tags, even to a single character

EDIT2: I could use a single-character as the Start/End delete marker - if, for example, that would be faster for a RegEx expression to resolve e.g. by being less complex

e.g. if the Start-Delete is [ and then end delete is ]

Keep1
[
Delete1a
    [
    Delete2
    ]
Delete1b
]
Keep2
[
Delete3
]
Keep3

These characters chosen for appearance in this example, in practice they would occur within my real-world data, but I expect I could chose two ASCII values which do not appear in my data at all.

Clarification: The {DELETE} tags will not always appear on a line by themselves, so this style of string formation will also exist

Keep1{/*DELETE}Delete1a
    {/*DELETE}Delete2{DELETE*/}
Delete1b{DELETE*/}Keep2a
Keep2b{/*DELETE}Delete3{DELETE*/}Keep3

or with single-character delete-tags:

Keep1[Delete1a
    [Delete2]
Delete1b]Keep2a
Keep2b[Delete3]Keep3
Kristen
  • 4,227
  • 2
  • 29
  • 36
  • Just please clarify: do you want to get `Keep1 Keep2 Keep3` in the end as three lines? What do you mean you can change the tags to single chars? It is not a good idea to use the same char for beginning/ending delimiters of a block, it is best to have a pair of different chars for the delimitation. – Wiktor Stribiżew Nov 02 '17 at 13:12
  • 1
    *Am I correct that a RegEx cannot be used to select only the inner DELETE2 and DELETE3, remove those, and then repeat to get the DELETE1a/b until no further matches are found?* That depends. There are edge cases where you can make it work with regular expressions, but not in general. – Ansgar Wiechers Nov 02 '17 at 13:16
  • 1
    I know you can use this regex in other languages: `({/\*DELETE}(?:.*?(?1).*?|.*?){DELETE\*/})`, I'm unable to test for vbscript at the moment though. – ctwheels Nov 02 '17 at 13:37
  • @Wiktor Stribiżew - I was meaning to use different single characters for Start/End e.g. if excluding a single-character (in RegEx) was easier than trying to exclude a whole {TAG} – Kristen Nov 02 '17 at 14:30
  • @Kristen Please add an example. – Wiktor Stribiżew Nov 02 '17 at 14:33
  • @Wiktor Stribiżew - clarification added. Do you think a single-character would allow a faster (i.e. less complex) RegEx expression? Good performance is a requirement for this code. – Kristen Nov 02 '17 at 15:12
  • Of course it will be super quick - `\[[^\][]+\]`. But are you sure you can rely on the `[` and `]` to be absent from the data you need? – Wiktor Stribiżew Nov 02 '17 at 15:16
  • Thanks. Yes, I can assign the Open/Close single-character tag based on characters that are very unlikely to be present in the string, but also test for existence and if found then empirically find non-existing characters. – Kristen Nov 02 '17 at 15:21

1 Answers1

2

Multicharacter delimiters

If your delimiters are multicharacter tags, you may use a tempered greedy token:

\{\/\*DELETE}((?:(?!\{\/\*DELETE})[\s\S])*?)\{DELETE\*\/}
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

that will match any char, 0+ times, that is not a starting point for a {/*DELETE} char sequence. Run this regex replace recursively, see Iteration 1 and Iteration 2 demos.

NOTE that if you have these delimiters inside comments or string literals, this won't work correctly.

To make it safe, you may define that the delimiting tags only appear as single entities on a line:

^\s*\{\/\*DELETE}(\s*(?:\r?\n(?!\s*\{(?:\/\*DELETE|DELETE\*\/)}).*)*)\r?\n\s*\{DELETE\*\/}\s*$

See Iteration 1 and Iteration 2 demos (here, you will need to enable regExp.Multiline = True)

Single char delimiters

This is by far the easiest scenario - you may the starting delimiter char, then match any 0+ chars other than the starting and ending delimiter char using a negated character class - and then the ending delimiter char.

If the starting delimiter char is [ and the ending delimiter char is ], the regex is a well-known

\[[^\][]*\]

See the regex demo: Iteration 1 and Iteration 2.

Note that [ and ] usually are part of data you need, so perhaps, you will want to use some more fancy paired stuff, like (‎2985 LEFT WHITE PARENTHESIS) and (‎2986 RIGHT WHITE PARENTHESIS):

\u2985[^\u2985\u2986]*\u2986

See another regex demo.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    @Kristen I think you should use `⦅` and `⦆` to allow usual brackets in the data. Or check the Unicode table for weirder stuff :) – Wiktor Stribiżew Nov 02 '17 at 15:25
  • "fullwidth left white parenthesis" (U+FF5F) - never even knew that that was a thing let alone its cousins such as "left white tortoise shell bracket"! Thanks for that, looks ideal. – Kristen Nov 02 '17 at 15:35