0

Have a strange issue where I need to remove JSON text in a tilde delimited file (having the JSON breaks the import due to CRLF at the end of each line of the JSON). Example line:

Test Plan Work~Response Status: BadRequest Bad Request,Response Content: {
  "trace": "0HM5285F2",
  "errors": [
    {
      "code": "server_error",
      "message": "Couldn't access service ",
      "moreInfoUrl": null,
      "target": {
        "type": null,
        "name": null
      }
    }
  ]
},Request: https://www.test.com Headers: Accept: application/json
SubscriberId: 
~87c5de00-5906-4d2d-b65f-4asdfsdfsdfa29~3/17/2020 1:54:08 PM

or ones like these that don't have JSON but still have the same pattern I need:

Test Plan Pay Work~Response Status: InternalServerError Internal Server Error,Response Content: Error,Request: https://api.test.com Headers: Accept: application/json
Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5c
SubscriberId: eb7aee
~9d05b16e-e57b-44be-b028-b6ddsdfsdf62a5~1/20/2021 7:07:53 PM

Need both of these types of CSV text to be in the format:

Test Plan Work~Response Status: BadRequest Bad Request~87c5de00-5906-4d2d-b65f-4asdfsdfsdfa29~3/17/2020 1:54:08 PM

The JSON (including the CRLF's at the end of each line of the JSON) are breaking the import of the data into Powershell. Any help or insight would be appreciated!

1 Answers1

0

PowerShell (or rather, .NET) has two perculiar features in its regex engine that might be perfect for this use case - balancing groups and conditionals!

Balancing groups is a complicated feature to fully explain, but it essentially allows us to "keep count" of occurrences of specific named subexpressions in a regex pattern, and looks like this when applied:

PS ~> $string = 'Here is text { but wait { it has } nested { blocks }} here is more text'
PS ~> $string -replace '\{(?>\{(?<depth>)|[^{}]+|\}(?<-depth>))*(?(depth)(?!))\}'
Here is text  here is more text

Let's break down the regex pattern:

\{                    # match literal '{'
(?>                   # begin atomic group* 
     \{(?<depth>)     #     match literal '{' and increment counter
  |  [^{}]+           #  OR match any sequence of characters that are NOT '{' or '}'
  |  \}(?<-depth>)    #  OR match literal '}' and decrement counter
)*                    # end atomic group, whole group should match 0 or more times
(?                    # begin conditional group*
    (depth)(?!)       # if the 'depth' counter > 0, then FAIL!
)                     # end conditional group
\}                    # match literal '}' (corresponding to the initial '{')

*) The (?>...) atomic grouping prevents backtracking - a safeguard against accidentally counting anything more than once.

For the CRLF characters in the remaining fields, we can prefix the pattern with (?s) - this makes the regex engine include new lines when matching the . "any" metacharacter, up until we reach the position just before ~87c5...:

(?s),Response Content:\s*\{(?>\{(?<depth>)|[^{}]+|\}(?<-depth>))*(?(depth)(?!))\}.*?(?=~)

Or we can, perhaps more accurately, describe the fields following the JSON as repeating pairs of , and "not ,":

,Response Content:\s*(?:\{(?>\{(?<depth>)|[^{}]+|\}(?<-depth>))*(?(depth)(?!))\})?\s*(?:,[^,]+?)*(?=~)

Let's give it a try against your multi-line input string:

$string = @'
Test Plan Work~Response Status: BadRequest Bad Request,Response Content: {
  "trace": "0HM5285F2",
  "errors": [
    {
      "code": "server_error",
      "message": "Couldn't access service ",
      "moreInfoUrl": null,
      "target": {
        "type": null,
        "name": null
      }
    }
  ]
},Request: https://www.test.com Headers: Accept: application/json
SubscriberId: 
~87c5de00-5906-4d2d-b65f-4asdfsdfsdfa29~3/17/2020 1:54:08 PM
'@
$string -replace ',Response Content:\s*(?:\{(?>\{(?<depth>)|[^{}]+|\}(?<-depth>))*(?(depth)(?!))\})?\s*(?:,[^,]+?)*(?=~)'

Output:

Test Plan Work~Response Status: BadRequest Bad Request~87c5de00-5906-4d2d-b65f-4asdfsdfsdfa29~3/17/2020 1:54:08 PM
Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
  • @JackBlack Make the whole JSON subexpression "optional" by wrapping it in a grouping construct and add the `?` quantifier at the end (I've updated the last example) – Mathias R. Jessen Mar 09 '21 at 20:01
  • @JackBlack add the sample [to your question](https://stackoverflow.com/posts/66552720/edit), comments are not really fit for exchange code or data :-) – Mathias R. Jessen Mar 09 '21 at 20:21
  • Good point, sorry about that,.,added above. Very new to this :) – Jack Black Mar 09 '21 at 20:24
  • Added an example of what I was asking in original question. The original regex works great on the JSON sections but misses the ones with HTML. I am still looking to negate everything between the same items as originally. THis must be so close...Thanks for your assistance so far! – Jack Black Mar 09 '21 at 21:05
  • @MathiasRJessen Any other suggestions to include the additional scenario above? This regex stuff is currently out of my league but trying to work through it... really appreciate any input! – Jack Black Mar 10 '21 at 14:57