How can I replace certain characters (and not others) between delimiters and across newlines, using Regex in Powershell?

Question

Here is a sample:

: [
          {
            "yearGroupId": 13,
            "educationPhaseEnum": 2,
            "name": "Year Group 12",
            "label": "YG 12"
          },
          {
            "yearGroupId": 14,
            "educationPhaseEnum": 2,
            "name": "Year Group 13",
            "label": "YG 13"
          }
        ]

I want to remove the line breaks, and all quotes. I only want to do this between the strings ': [' and ' ]'. So the desired output would look like this:

[      {        yearGroupId: 13,        educationPhaseEnum: 2,        name: Year Group 12,        label: YG 12      },      {        yearGroupId: 14,        educationPhaseEnum: 2,        name: Year Group 13,        label: YG 13      }    ]

I've tried Powershell -NoProfile "(Get-Content -Raw .\allacts.txt) -replace '(?<=\u003a\u0020\u005b).*[\n\r\u0022].*(?=\u0020\u0020\u0020\u0020\u005d)', '' | Out-File -FilePath allacts.txt -Force -Encoding ASCII"

and about a hundred other things... but can't get my head around how it's meant to work. What do I have to do to get Powershell to replace these characters within these bounds? In other places in the file I need the line breaks.

Thanks.

Edit: Yep, this is JSON data. The issue is that there are duplicate keys (I can't change that). Converting it to a CSV results Powershell ignoring duplicate keys and picking one of them to go into the output CSV. Directly importing the JSON into Excel (where I need it to go) results in Excel rejecting it as it can't handle duplicate keys.

So, I decided to just glom everything into one value and use Power Query to sort it out at the other end (using the commas as delimiters).

That looks like JSON data, so what are you going to do with it? — vonPryz, Feb 16 '22 at 17:27
You can certainly use a single regex for this because the left-hand and right-hand delimiters are different strings. — Wiktor Stribiżew, Feb 16 '22 at 23:08

mklement0 · Accepted Answer · 2022-02-17T08:36:23.120

Your task requires a dynamic -replace operation operation, which Windows PowerShell (powershell.exe) - unlike PowerShell (Core) 7+ (pwsh) - cannot provide directly:

You need to identify the block of interest in your input file...
...and then perform the desired transformations on that block only.

^{Update: As Wiktor's answer shows, non-dynamic single--replace operation solutions using look-around assertions - as you attempted - are possible - but they are somewhat mind-bending.}

This answer discusses dynamic replacements in more detail, but applied to your case this means (the assumption is that you're calling from outside PowerShell, such as from cmd.exe / a batch file):

powershell.exe -NoProfile -c "[regex]::Replace((Get-Content -Raw .\allacts.txt), '(?s): \[.+?\r?\n        \]', { param($match) $match.Value -replace '[\r\n\"]' }) | Out-File -FilePath allacts.txt -Force -Encoding ASCII"

For an explanation of the block-matching regex passed as the 2nd argument to [regex]::Replace() and the ability to experiment with it, see this regex101.com page.
The regex used inside the 3rd argument, the dynamic replacement script block ({ ... }), is [\r\n"], which matches all CR (\r), LF (\n), and " characters, and since it is used with -replace without a replacement operand, effectively removes them.

Thanks. I didn't realise it needed a completely different approach. Good to know I wasn't going to get anywhere trying to rearrange this or that character, good job I asked the question! — Ne Mo, Feb 16 '22 at 19:36
@NeMo, well, as Wiktor's answer shows, a solution with look-around assertions alone _is_ possible - but it is somewhat mind-bending. — mklement0, Feb 17 '22 at 08:16

Wiktor Stribiżew · Answer 2 · 2022-02-17T10:08:55.813

You can use either of the two plain string pattern regex replacements:

(Get-Content -Raw .\allacts.txt) -replace '(?s)(?<=: \[.*?)[\r\n"](?=.*? ])' | Out-File -FilePath allacts.txt -Force -Encoding ASCII

See this regex demo. Details:

(?s) - RegexOptions.Singleline enables . to match any chars including newline chars
(?<=: \[.*?) - a positive lookbehind that matches a location that is immediately preceded with : [ string and then any zero or more chars as few as possible
[\r\n"] - CR, LF or a " char
(?=.*? ]) - a positive lookahead that makes sure there are any zero or more chars as few as possible followed with a space + ] char immediately to the right of the current location.

Or, if you have : [.."...".: [ ... ] like strings, and you want to remove the chars only in between the closest : [ and ] you will need to use

(Get-Content -Raw .\allacts.txt) -replace '(?s)(?<=: \[(?:(?!: \[).)*?)[\r\n"](?=.*? ])' | Out-File -FilePath allacts.txt -Force -Encoding ASCII

See this regex demo (see Context tab). Details:

(?s) - RegexOptions.Singleline enables . to match any chars including newline chars
(?<=: \[(?:(?!: \[).)*?) - a positive lookbehind that matches a location that is immediately preceded with
- : \[ - : [ string
- (?:(?!: \[).)*? - any char, zero or more but as few as possible times, that does not start a : [ char sequence
[\r\n"] - CR, LF or a " char
(?=.*? ]) - a positive lookahead that makes sure there are any zero or more chars as few as possible followed with a space + ] char immediately to the right of the current location.

Matches are removed here.

Or,

(Get-Content -Raw .\allacts.txt) -replace '(?s)(\G(?!^)|: \[)(.*?)[\r\n"](?=.*? ])', '$1$2' | Out-File -FilePath allacts.txt -Force -Encoding ASCII

or

(Get-Content -Raw .\allacts.txt) -replace '(?s)(\G(?!^)|: \[)((?:(?!: \[).)*?)[\r\n"](?=.*? ])', '$1$2' | Out-File -FilePath allacts.txt -Force -Encoding ASCII

See this regex demo (do not forget to click Context tab there). Here

(?s) - . matches any chars now
(\G(?!^)|: \[) - Group 1 ($1): end of the previous match or : [ string
((?:(?!: \[).)*?) - Group 2 ($2): any char, zero or more but as few as possible times, that does not start a : [ char sequence
[\r\n"] - CR, LF or "
(?=.*? ]) - a check that there is space + ] somewhere on the right.

In this case, matches are replaced with Group 1 + Group 2 values.

Replace literal spaces with \s* (zero or more whitespaces) or \s+ (one or more whitespaces) in the pattern if you mean to match any (amount of) whitespaces.

Thanks for the solution. The other one is more readable but there may be situations in which it can't be used for some reason (I don't know) — Ne Mo, Feb 17 '22 at 10:04
@NeMo I might overcomplicated the pattern, what about just `'(?s)(?<=: \[.*?)[\r\n"](?=.*? ])'`? I added this one to the answer and explained the solution difference. — Wiktor Stribiżew, Feb 17 '22 at 10:06

How can I replace certain characters (and not others) between delimiters and across newlines, using Regex in Powershell?

2 Answers2