13

I have a file containing some properties which value of some of them contains escape characters, for example some Urls and Regex patterns.

When reading the content and converting back to the json, with or without unescaping, the content is not correct. If I convert back to json with unescaping, some regular expression break, if I convert with unescaping, urls and some regular expressions will break.

How can I solve the problem?

Minimal Complete Verifiable Example

Here are some simple code blocks to allow you simply reproduce the problem:

Content

$fileContent = 
@"
{
    "something":  "http://domain/?x=1&y=2",
    "pattern":  "^(?!(\\`|\\~|\\!|\\@|\\#|\\$|\\||\\\\|\\'|\\\")).*"
}
"@

With Unescape

If I read the content and then convert the content back to json using following command:

$fileContent | ConvertFrom-Json | ConvertTo-Json | %{[regex]::Unescape($_)}

The output (which is wrong) would be:

{
    "something":  "http://domain/?x=1&y=2",
    "pattern":  "^(?!(\|\~|\!|\@|\#|\$|\||\\|\'|\")).*"
}

Without Unescape

If I read the content and then convert the content back to json using following command:

$fileContent | ConvertFrom-Json | ConvertTo-Json 

The output (which is wrong) would be:

{
    "something":  "http://domain/?x=1\u0026y=2",
    "pattern":  "^(?!(\\|\\~|\\!|\\@|\\#|\\$|\\||\\\\|\\\u0027|\\\")).*"
}

Expected Result

The expected result should be same as the input file content.

mklement0
  • 382,024
  • 64
  • 607
  • 775
Reza Aghaei
  • 120,393
  • 18
  • 203
  • 398

3 Answers3

23

I decided to not use Unescape, instead replace the unicode \uxxxx characters with their string values and now it works properly:

$fileContent = 
@"
{
    "something":  "http://domain/?x=1&y=2",
    "pattern":  "^(?!(\\`|\\~|\\!|\\@|\\#|\\$|\\||\\\\|\\'|\\\")).*"
}
"@

$fileContent | ConvertFrom-Json | ConvertTo-Json | %{
    [Regex]::Replace($_, 
        "\\u(?<Value>[a-zA-Z0-9]{4})", {
            param($m) ([char]([int]::Parse($m.Groups['Value'].Value,
                [System.Globalization.NumberStyles]::HexNumber))).ToString() } )}

Which generates the expected output:

{
    "something":  "http://domain/?x=1&y=\\2",
    "pattern":  "^(?!(\\|\\~|\\!|\\@|\\#|\\$|\\||\\\\|\\'|\\\")).*"
}
mklement0
  • 382,024
  • 64
  • 607
  • 775
Reza Aghaei
  • 120,393
  • 18
  • 203
  • 398
  • 4
    This was most helpful in solving a problem to edit an ARM (Azure Resource Manager) template. – Stringfellow Jan 11 '19 at 01:32
  • 2
    This is exactly what I was looking for in order to render Powershell's JSON output interoperable with my Python parsers...Seems that Microsoft really made it near impossible to use Powershell for JSON (by default ConvertTo-JSON writes an UTF-8 file with BOM, which is just unusable too in REST world)... Anyway, thanks a lot :) – Orsiris de Jong Apr 01 '19 at 11:14
  • 1
    @OrsirisdeJong Indeed, JSON escaping seems like a pain for PowerShell. – r3verse Sep 13 '19 at 10:45
  • I think the regex must be (?<=[^\\\])\\u(?[a-zA-Z0-9]{4}) to avoid replaces of \\u... Add **(?<=[^\\\])** lookahead. – Ilyan Apr 22 '20 at 11:55
  • 2
    @Ilyan your suggestion doesn't handle situations when \u is a start of the string. Seems like this regex handles it (?<![\\\])\\u(?[a-zA-Z0-9]{4}). Add **(?<![\\\])** – Maxim Ozerov Nov 17 '20 at 12:16
  • @MaximOzerov, ConvertTo-Json returns valid JSON, it can't start with \u. But you regex supports also unescaping JSON substrings, thanks. – Ilyan Feb 01 '21 at 21:16
  • Have been using this code successfully for quite some time - thanks @reza-aghaei - but ran into some unexpected results today, when the function tried to parse a non-numeric value (\updat)e. Changing [a-zA-Z0-9] to [a-fA-F0-9] fixes this case as non-hex characters will not trigger a match. – Magnus Apr 11 '23 at 15:15
3

If you don't want to rely on Regex (from @Reza Aghaei's answer), you could import the Newtonsoft JSON library. The benefit is the default StringEscapeHandling property which escapes control characters only. Another benefit is avoiding the potentially dangerous string replacements you would be doing with Regex.

This StringEscapeHandling is also the default handling of PowerShell Core (version 6 and up) because they started to use Newtonsoft internally since then. So another alternative would be to use ConvertFrom-Json and ConvertTo-Json from PowerShell Core.

Your code would look something like this if you import the Newtonsoft JSON library:

[Reflection.Assembly]::LoadFile("Newtonsoft.Json.dll")

$json = Get-Content -Raw -Path file.json -Encoding UTF8 # read file
$unescaped = [Newtonsoft.Json.Linq.JObject]::Parse($json) # similar to ConvertFrom-Json

$escapedElementValue = [Newtonsoft.Json.JsonConvert]::ToString($unescaped.apiName.Value) # similar to ConvertTo-Json
$escapedCompleteJson = [Newtonsoft.Json.JsonConvert]::SerializeObject($unescaped) # similar to ConvertTo-Json

Write-Output "Variable passed = $escapedElementValue"
Write-Output "Same JSON as Input = $escapedCompleteJson"
r3verse
  • 1,000
  • 8
  • 19
  • Is that solution portable ? There is like one DLL per .Net version of NewtonSoft's JSON library. Depending on the target OS, one would have to bundle different versions of that same DLL, doesn't it ? – Orsiris de Jong Sep 14 '19 at 13:15
  • 1
    @OrsirisdeJong Most OS's support 4.5 and up, so take the lowest you need as .NET is backwards compatible. Newtonsoft goes as low as .NET 2.0! Unless you need to target Windows XP or lower systems, i wouldn't look any more further than 4.5. – r3verse Sep 14 '19 at 14:00
  • Thanks. Last question, my targets are NT6.1+, so I can go with .Net framework 3.5. I target 32 bit and 64 bit systems, but did only find one version of the DLL per .net Framework version, regardless of the bitness. Is it a 32 bit DLL that loads on 64 bit systems, or is there something I missed? – Orsiris de Jong Sep 14 '19 at 14:08
  • 1
    @OrsirisdeJong It's not really specified but i guess they target both 32-bit and 64-bit systems. (AnyCPU configuration; see: https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-options/platform-compiler-option) – r3verse Sep 14 '19 at 14:23
  • @r3verse Thanks for sharing the idea. A few days ago (the day that I added a link in your post and +1) I gave it a try and the result was acceptable. However, both answers (mine and yours) have a small defect, they are ignoring the \` character which I haven't noticed before. Do you have any idea about it? – Reza Aghaei Oct 07 '19 at 09:42
  • @RezaAghaei Thanks for the comment, i tried to reproduce the error but i have no problem with the grave-accent(`) PowerShell escape character. I've also updated my answer to include 2 scenario's to output either an element value or the complete JSON. In both cases i get the grave-accent back in the Write-Output – r3verse Oct 07 '19 at 14:17
  • Thanks for checking it again. Could you please let me know the Newtonsoft.Json package version? Then I'll give it another try using the exact same input that I shared in the question. – Reza Aghaei Oct 07 '19 at 14:23
  • @RezaAghaei I'm using the latest stable release: 12.0.2 (from NuGet https://www.nuget.org/packages/Newtonsoft.Json/) – r3verse Oct 07 '19 at 14:30
2

Note:

  • Applying [regex]::Unescape() isn't called for, as JSON's escaping is unrelated to regex escaping.

  • That is, $fileContent | ConvertFrom-Json | ConvertTo-Json should work as-is, but doesn't due to a quirk in Windows PowerShell, which caused the & in your input string to be represented as its equivalent escape sequence on re-conversion, \u0026; the quirk similarly affects ' (\u0026), < (\u003c) and > (\u003e).


tl;dr

The problem does not affect PowerShell (Core) 6+ (the install-on-demand, cross-platform PowerShell edition), which uses a different implementation of the ConvertTo-Json and ConvertFrom-Json cmdlets, namely, as of PowerShell 7.2.x, one based on Newtonsoft.JSON (whose direct use is shown in r3verse's answer). There, your sample roundtrip command works as expected.

Only ConvertTo-Json in Windows PowerShell is affected (the bundled-with-Windows PowerShell edition whose latest and final version is 5.1). But note that the JSON representation - while unexpected - is technically correct.

A simple, but robust solution focused only on unescaping those Unicode escape sequences that ConvertTo-Json unexpectedly creates - namely for & ' < > - while ruling out false positives:

# The following sample JSON with undesired Unicode escape sequences for `& < > '`
# was created with Windows PowerShell's ConvertTo-Json as follows:
#   ConvertTo-Json "Ten o'clock at <night> & later. \u0027 \\u0027"
$json = '"Ten o\u0027clock at \u003cnight\u003e \u0026 later. \\u0027 \\\\u0027"'

[regex]::replace(
  $json, 
  '(?<=(?:^|[^\\])(?:\\\\)*)\\u(00(?:26|27|3c|3e))', 
  { param($match) [char] [int] ('0x' + $match.Groups[1].Value) },
  'IgnoreCase'
)

The above outputs the desired JSON representation, without the unnecessary escaping of &, ', <, and >, and without having falsely replaced the escaped substrings \\u0027 and \\\\u0027:

"Ten o'clock at <night> & later. \\u0027 \\\\u0027"

Background information:

ConvertTo-Json in Windows PowerShell unexpectedly represents the following ASCII-range characters by their Unicode escape sequences in JSON strings:

  • & (Unicode escape sequence: \u0026)
  • ' (\u0027)
  • < and > (\u003c and \u003e)

There's no good reason to do so (these characters only require escaping in HTML/XML text).

However, any compliant JSON parser - including ConvertFrom-Json - converts these escape sequences back to the characters they represent.

In other words: While the JSON text created by Windows PowerShell's ConvertTo-Json is unexpected and can impede readability, it is technically correct and - while not identical - equivalent to the original representation in terms of the data it represents.


Fixing the readability problem:

As an aside: While [regex]::Unescape(), whose purpose is to unescape regexes only, also converts Unicode escape sequences to the characters they represent, it is fundamentally unsuited to selectively unescaping Unicode sequences JSON strings, given that all other \ escapes must be preserved in order for the JSON string to remain syntactically valid.

While your answer works well in general, it has limitations (aside from the easily corrected problem that a-zA-Z should be a-fA-F to limit matching to those letters that are valid hex. digits):

  • It doesn't rule out false positives, such as \\u0027 or \\\\u0027 (\\ escapes \, so that the u0027 part becomes a verbatim string and must not be treated as an escape sequence).

  • It converts all Unicode escape sequences, which presents two problems:

    • Escape sequences representing characters that require escaping would also be converted to the verbatim character representations, which would break the JSON representations with \u005c, for instance, given that the character it represents, \, requires escaping.

    • For non-BMP Unicode characters that must be represented as pairs of Unicode escape sequences (so-called surrogate pairs), your solution would mistakenly try to unescape each half of the pair separately.

For a robust solution that overcomes these limitations, see this answer (surrogate pairs are left as Unicode escape sequences, Unicode escape sequences whose characters require escaping are converted to \-based (C-style) escapes, such as \n, if possible).

However, if the only requirement is to unescape those Unicode escape sequences that Windows PowerShell's ConvertTo-Json unexpectedly creates, the solution at the top is sufficient.

mklement0
  • 382,024
  • 64
  • 607
  • 775