1

Here is a sample:

: [
          {
            "yearGroupId": 13,
            "educationPhaseEnum": 2,
            "name": "Year Group 12",
            "label": "YG 12"
          },
          {
            "yearGroupId": 14,
            "educationPhaseEnum": 2,
            "name": "Year Group 13",
            "label": "YG 13"
          }
        ]

I want to remove the line breaks, and all quotes. I only want to do this between the strings ': [' and ' ]'. So the desired output would look like this:

[      {        yearGroupId: 13,        educationPhaseEnum: 2,        name: Year Group 12,        label: YG 12      },      {        yearGroupId: 14,        educationPhaseEnum: 2,        name: Year Group 13,        label: YG 13      }    ]

I've tried Powershell -NoProfile "(Get-Content -Raw .\allacts.txt) -replace '(?<=\u003a\u0020\u005b).*[\n\r\u0022].*(?=\u0020\u0020\u0020\u0020\u005d)', '' | Out-File -FilePath allacts.txt -Force -Encoding ASCII"

and about a hundred other things... but can't get my head around how it's meant to work. What do I have to do to get Powershell to replace these characters within these bounds? In other places in the file I need the line breaks.

Thanks.

Edit: Yep, this is JSON data. The issue is that there are duplicate keys (I can't change that). Converting it to a CSV results Powershell ignoring duplicate keys and picking one of them to go into the output CSV. Directly importing the JSON into Excel (where I need it to go) results in Excel rejecting it as it can't handle duplicate keys.

So, I decided to just glom everything into one value and use Power Query to sort it out at the other end (using the commas as delimiters).

Ne Mo
  • 198
  • 4
  • 18

2 Answers2

3

Your task requires a dynamic -replace operation operation, which Windows PowerShell (powershell.exe) - unlike PowerShell (Core) 7+ (pwsh) - cannot provide directly:

  • You need to identify the block of interest in your input file...

  • ...and then perform the desired transformations on that block only.

Update: As Wiktor's answer shows, non-dynamic single--replace operation solutions using look-around assertions - as you attempted - are possible - but they are somewhat mind-bending.

This answer discusses dynamic replacements in more detail, but applied to your case this means (the assumption is that you're calling from outside PowerShell, such as from cmd.exe / a batch file):

powershell.exe -NoProfile -c "[regex]::Replace((Get-Content -Raw .\allacts.txt), '(?s): \[.+?\r?\n        \]', { param($match) $match.Value -replace '[\r\n\"]' }) | Out-File -FilePath allacts.txt -Force -Encoding ASCII"
  • For an explanation of the block-matching regex passed as the 2nd argument to [regex]::Replace() and the ability to experiment with it, see this regex101.com page.

  • The regex used inside the 3rd argument, the dynamic replacement script block ({ ... }), is [\r\n"], which matches all CR (\r), LF (\n), and " characters, and since it is used with -replace without a replacement operand, effectively removes them.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    Thanks. I didn't realise it needed a completely different approach. Good to know I wasn't going to get anywhere trying to rearrange this or that character, good job I asked the question! – Ne Mo Feb 16 '22 at 19:36
  • @NeMo, well, as Wiktor's answer shows, a solution with look-around assertions alone _is_ possible - but it is somewhat mind-bending. – mklement0 Feb 17 '22 at 08:16
  • 1
    Ha. Either way I wasn't going to work out out myself! – Ne Mo Feb 17 '22 at 10:03
1

You can use either of the two plain string pattern regex replacements:

(Get-Content -Raw .\allacts.txt) -replace '(?s)(?<=: \[.*?)[\r\n"](?=.*? ])' | Out-File -FilePath allacts.txt -Force -Encoding ASCII

See this regex demo. Details:

  • (?s) - RegexOptions.Singleline enables . to match any chars including newline chars
  • (?<=: \[.*?) - a positive lookbehind that matches a location that is immediately preceded with : [ string and then any zero or more chars as few as possible
  • [\r\n"] - CR, LF or a " char
  • (?=.*? ]) - a positive lookahead that makes sure there are any zero or more chars as few as possible followed with a space + ] char immediately to the right of the current location.

Or, if you have : [.."...".: [ ... ] like strings, and you want to remove the chars only in between the closest : [ and ] you will need to use

(Get-Content -Raw .\allacts.txt) -replace '(?s)(?<=: \[(?:(?!: \[).)*?)[\r\n"](?=.*? ])' | Out-File -FilePath allacts.txt -Force -Encoding ASCII

See this regex demo (see Context tab). Details:

  • (?s) - RegexOptions.Singleline enables . to match any chars including newline chars
  • (?<=: \[(?:(?!: \[).)*?) - a positive lookbehind that matches a location that is immediately preceded with
    • : \[ - : [ string
    • (?:(?!: \[).)*? - any char, zero or more but as few as possible times, that does not start a : [ char sequence
  • [\r\n"] - CR, LF or a " char
  • (?=.*? ]) - a positive lookahead that makes sure there are any zero or more chars as few as possible followed with a space + ] char immediately to the right of the current location.

Matches are removed here.

Or,

(Get-Content -Raw .\allacts.txt) -replace '(?s)(\G(?!^)|: \[)(.*?)[\r\n"](?=.*? ])', '$1$2' | Out-File -FilePath allacts.txt -Force -Encoding ASCII

or

(Get-Content -Raw .\allacts.txt) -replace '(?s)(\G(?!^)|: \[)((?:(?!: \[).)*?)[\r\n"](?=.*? ])', '$1$2' | Out-File -FilePath allacts.txt -Force -Encoding ASCII

See this regex demo (do not forget to click Context tab there). Here

  • (?s) - . matches any chars now
  • (\G(?!^)|: \[) - Group 1 ($1): end of the previous match or : [ string
  • ((?:(?!: \[).)*?) - Group 2 ($2): any char, zero or more but as few as possible times, that does not start a : [ char sequence
  • [\r\n"] - CR, LF or "
  • (?=.*? ]) - a check that there is space + ] somewhere on the right.

In this case, matches are replaced with Group 1 + Group 2 values.

Replace literal spaces with \s* (zero or more whitespaces) or \s+ (one or more whitespaces) in the pattern if you mean to match any (amount of) whitespaces.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks for the solution. The other one is more readable but there may be situations in which it can't be used for some reason (I don't know) – Ne Mo Feb 17 '22 at 10:04
  • @NeMo I might overcomplicated the pattern, what about just `'(?s)(?<=: \[.*?)[\r\n"](?=.*? ])'`? I added this one to the answer and explained the solution difference. – Wiktor Stribiżew Feb 17 '22 at 10:06