Paragraph conditional nested regular expression (recursion)

Question

I need a regular expression that matches the paragraph: '&Start a'(first in sample text) until '&end a' (last end from sample text). The problem is that sometimes '&end a' is not explicitly specified, and is sometimes written as '&end'. The problem is even bigger when you have '&Start b' and '&end b' (which is sometimes '&end' as well, hence the confusion).

A sample block of target for this regex is (sorry for putting it as code block):

junk text

&Start a <

fulfilling text

fulfilling text

&Start b

&Start c

&end c

fulfilling text

&end

&end <

junk text

So the regex should match all paragraph starting and ending with the lines which contain the < symbol, though it's not included in the original text. (i.e. with the &Start X we want, and skipping the '&Start Y' '&end' (or '&end Y') groups until the '&end' (or '&end X') we want.

This is not a simple implementation. The expression I am working with is the following:

&start a([^&]*)(&end a|&end)

Which maches well isolated '&start a' '&end' paragraphs, but when other '&start Y' lines come in between, the script gets confused. I might use some If statment that jumps the undesired blocks... Here is a more complicated approach of the case:

junk text

&Start a <

fulfilling text

fulfilling text

&Start b

&Start c

&end

fulfilling text

&end

&end <

junk text

Without specifying any '&end'. Note1: '&start X' is always defined, but '&end X' can be as well '&end', but always corresponds to the closest start upfront. Note2: I can't change much the structure of my regex because of stack overflow errors, but rather adapt it to this specific case.

Sorry for the weird explaination but I hope somebody can find any viable advise.

Thank you

Edit:

#@ -split "`n" | ForEach-Object { $_.trim() } |

$files = Get-ChildItem "$PSScriptRoot" # root path

for($i=0; $i -lt $files.Count; $i++){

    #iterate through files from the current folder.
    $data = Get-Content -Path $files[$i].FullName

    # parse DisabledFeatures.txt file as array of strings (1 string per line of the file)
    $feature = Get-Content DisabledFeatures.txt

    #iterate for each string entry in $feature array (read from txt file)
    for($counter=0; counter -lt $feature.Count; counter++){

        #retrieve array value to use it in the main algorythm
        $groupID = "$feature"

        $data | ForEach-Object -Begin { $ignore = $false; $levels = 0 } -Process {
            #Start ignoring text after we've found the trigger
            if($_ -match "^`#ifdef $groupID") { $ignore = $true }   
            #Track nested groups
            elseif($ignore) {
                if ($_ -match '^`#ifdef') { $levels++ }
                elseif ($_ -match '`#endif') {
                    if($levels -ge 1) { $levels-- }
                    #If no nesting, we've hit the end of our targeted group. Stop ignoring
                    else { $ignore = $false }
                }
            }
            #Write line
            else { $_ }
        }  
    }
}

You can't use regex for this, you need a parser. [Obligatory link](http://stackoverflow.com/a/1732454/52598) — Lieven Keersmaekers, Feb 13 '17 at 10:18
It is not quite clear, see https://regex101.com/r/GJ7dG6/1 - do you need something like this? And is it in Java or Powershell? — Wiktor Stribiżew, Feb 13 '17 at 10:30
java. No sorry, the '<' symbol is not part of the string. It is only to state in my explaination where the match should start and end. — Jackson, Feb 13 '17 at 10:32
You want it to end at the matching `$end` "tag" when there can be multiple start/end tag's in between. Like I said, that's not possible with regex *(and keep your sanity)* — Lieven Keersmaekers, Feb 13 '17 at 10:36
@WiktorStribiżew - That would work for `&Start a` but not for `$Start b` — Lieven Keersmaekers, Feb 13 '17 at 10:37
Ok, there is no way to write a Java regex for that kind of text (as Java regex does not support recursion). You really need to write some parsing code for that. — Wiktor Stribiżew, Feb 13 '17 at 10:39
@LievenKeersmaekers exactly what I want. The thing is that If I am looking for '&Start a', i want to delete until the 'end' which corresponds to 'start a' (because they are in order), and not caring about anything in between. — Jackson, Feb 13 '17 at 10:45
@WiktorStribiżew Any suggestion about this parsing code stuff ? — Jackson, Feb 13 '17 at 10:49
If you know the max nesting depth you can build the regex manually which of course looks ugly. For your sample max of two levels such as [`(?is)&start(?>(?:(?!&start|&end).)+|&start(?:(?:(?!&start|&end).)+|&start(?:(?:(?!&start|&end).)+)&end)*&end)*&end`](http://fiddle.re/73kzdn) deeper nesting needs to be added to pattern. — bobble bubble, Feb 13 '17 at 10:58
@bobblebubble your example for some reason ends up in catastrophic backtracking :) — Jackson, Feb 13 '17 at 11:00
@Jackson Well don't know your input. I [tested it also here](https://regex101.com/r/ZcuN6P/1) but that is a pcre tester. — bobble bubble, Feb 13 '17 at 11:02
@bobblebubble So this will only work with 2 additional 'start'-'end' groups in between , and it is impossible to extend it to more groups ? — Jackson, Feb 13 '17 at 11:05
Jackson up two levels of nesting (start a, start b, start c). Yes, deeper nesting needs to be added and the pattern looks a bit more catastrophic with each added (: but if you mean it should match such as start a, start b, start b, start c, start b, start c... until end a. — bobble bubble, Feb 13 '17 at 11:07
See how to [add max nesting depth here](https://regex101.com/r/ZcuN6P/2) (example with max 5 levels of nesting, remove newlines in pattern when done) but sounds like your input might be huge. — bobble bubble, Feb 13 '17 at 11:18
@bobblebubble in your last example, where are you indicating that the first 'start' must be 'start a' ? or is it just generic ? — Jackson, Feb 13 '17 at 11:26
@Jackson it's generic and will match the most outer level of up to 5 nested levels. — bobble bubble, Feb 13 '17 at 11:26
@bobblebubble - You are sending OP on a path to insanity. You have him believe that it's doable using regex and it's really not. I can't fathom OP knowing the nesting level beforehand and figuring it out takes as long as just copy paste the whole block at once. — Lieven Keersmaekers, Feb 13 '17 at 11:27
Jackson better listen to @LievenKeersmaekers who is surely right (: — bobble bubble, Feb 13 '17 at 11:28
@WiktorStribiżew - You meant using recursion as in [this question](http://stackoverflow.com/questions/26385984/recursive-pattern-in-regex). — Lieven Keersmaekers, Feb 13 '17 at 11:41
@LievenKeersmaekers: Yes, something like that. In Powershell, with .NET regex, it is also possible with the balanced constructs. — Wiktor Stribiżew, Feb 13 '17 at 11:44
@WiktorStribiżew - It's tagged `Powershell` so all .Net functionality is available. Feel free to provide an answer using those constructs *(and teach us a thing or two)* — Lieven Keersmaekers, Feb 13 '17 at 12:00
Once the contents is read into a variable, [`(?ism)^&Start a(?:(?!^&(?:end|start)\b).|(?)^&start\b|(?<-c>)^&end\b)*(?(c)(?!))^&end`](http://regexstorm.net/tester?p=%5e%26Start+a%28%3f%3a%28%3f!%5e%26%28%3f%3aend%7cstart%29%5cb%29.%7c%28%3f%3cc%3e%29%5e%26start%5cb%7c%28%3f%3c-c%3e%29%5e%26end%5cb%29*%28%3f%28c%29%28%3f!%29%29%5e%26end&i=junk+text%0d%0a%0d%0a%26Start+a%0d%0a%0d%0afulfilling+text%0d%0a%0d%0afulfilling+text%0d%0a%0d%0a%26Start+b%0d%0a%0d%0a%26Start+c%0d%0a%0d%0a%26end+c%0d%0a%0d%0afulfilling+text%0d%0a%0d%0a%26end%0d%0a%0d%0a%26end%0d%0a%0d%0ajunk+text&o=ism) can be used. — Wiktor Stribiżew, Feb 13 '17 at 12:07
@WiktorStribiżew this gives a pattern error for some reason , but the main point here is that c, d, e, f, g, etc, are unknown strings :) — Jackson, Feb 13 '17 at 12:42
If you use it in Java, sure it will. This is a .NET regex. The pattern does not care if you have `c`, `d`, `e` or `g`, or `@$#^UKUMNFG`. — Wiktor Stribiżew, Feb 13 '17 at 12:45
What does exactly use it in java mean? I have a Java Script which calls the matcher with the regexr, but not sure about. — Jackson, Feb 13 '17 at 12:58

Frode F. · Answer 1 · 2017-02-16T09:17:03.177

1

A pure regex-solution is probably not the best solution for this problem. It can probably be done, but it would likely be very complex and unreadable. I would use a simple parser for this. Example:

function Remove-TextGroup {
    param(
        [Parameter(Mandatory=$true)]
        [string[]]$Data,
        [Parameter(Mandatory=$true)]
        [string]$GroupID
    )

    $Data | ForEach-Object -Begin { $ignore = $false; $levels = 0 } -Process {
        #Start ignoring text after we've found the trigger
        if($_ -match "^&start $GroupID") { $ignore = $true }   
        #Track nested groups
        elseif($ignore) {
            if ($_ -match '^&start') { $levels++ }
            elseif ($_ -match '^&end') {
                if($levels -ge 1) { $levels-- }
                #If no nesting, we've hit the end of our targeted group. Stop ignoring
                else { $ignore = $false }
            }
        }
        #Write line
        else { $_ }

    }
}

Usage:

$data = @"
junk text

&Start a <

fulfilling text

fulfilling text

&Start b

&Start c

&end

fulfilling text

&end

&end <

junk text
"@ -split "`n" | ForEach-Object { $_.trim() } |
#Remove empty lines
Where-Object { $_ }

Remove-TextGroup -Data $data -GroupID a    

#Or to read from file.. 
#$data = Get-Content -Path Myfile.txt
Remove-TextGroup -Data $data -GroupID a

Output:

junk text
junk text

If the files are big, I would optimize the sample above to use a streamreader for reading the file.

edited Feb 16 '17 at 09:17

answered Feb 13 '17 at 14:41

Frode F.

52,376
9
98
114

Test inside a should be as well deleted. I will try to implement your solution. Is there any way to create a powershell script that can be run from a batch file? – Jackson Feb 14 '17 at 06:40
Why? It's only inside group a. Didn't you say that you wanted to skip b and c groups? If we ignore "test inside a" then we could just as well have stopped on start b. – Frode F. Feb 14 '17 at 06:48
Everything until 'end a' must be deleted. Even if it is in the a block. Sorry for my explaination.. – Jackson Feb 14 '17 at 07:02
I thought you wanted to keep group a. Were you going to remove the contents (not start and end) for group a and keep the rest of the document? If so, it should be specified better in the question. A desired output sample would also help. – Frode F. Feb 14 '17 at 07:10
I want to delete group a entirely, including start a and its corresponding end. I will update question later. Sorry again. – Jackson Feb 14 '17 at 07:12
See updated reversed answer. Does that solve the problem? – Frode F. Feb 14 '17 at 08:18
I will test it and give the answer as valid if it works :) – Jackson Feb 14 '17 at 08:48
I don' see the part of the code which tries to hook for '&Start a'... what is supposed to do the variable $groupID? also, how to modify this script so that it gets input from file ? – Jackson Feb 16 '17 at 08:08
`$groupID` is the variable that defines the name of the Group you want, in this case it's `a`, but this is probably something you want to have as a parameter in a script/function. `if($_ -match "^&start $groupID") { $ignore = $true } ` is the line that uses it to find the Group. There is a comment how to read from file in the code. I don't know if you wanted to modify a file, read input from another cmdlet etc. so you will have to modify it to your needs. – Frode F. Feb 16 '17 at 09:09
I tried to make an adaptation from your code but I don't understand the first line of your script, is it the main deleter of the match? Also check my question again to see what I built. – Jackson Feb 16 '17 at 09:32
What first line? See updated answer where I made it into a cleaner function – Frode F. Feb 16 '17 at 10:01

Paragraph conditional nested regular expression (recursion)

1 Answers1