2

I have a powershell script which pulls out data from an API using the below command

Invoke-RestMethod -Method Post -Uri $WebServiceURL -Body $json -ContentType "Application/json" 

The data present in the API server end contains an Em Dash "–".

When I pull the data using Postman, it displays the Em Dash as it is, but the moment i pull the data using Powershell and print the output, it displays some weird characters as below.

OUPath=ABCD.COM/Test/All Users/India/Test/TestâOU/Desktop Users

Em Dash is printed as "â".

I tried converting the Output Encoding of Powershell using below command, but no luck.

[Console]::OutputEncoding = [Text.Encoding]::Utf8

Current Powershell Version Details.

PS Codes> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      5.1.19041.1
PSEdition                      Desktop
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
BuildVersion                   10.0.19041.1
CLRVersion                     4.0.30319.42000
WSManStackVersion              3.0
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1

Default Output Encoding as below :

PS Codes> [Console]::OutputEncoding


IsSingleByte      : True
BodyName          : IBM437
EncodingName      : OEM United States
HeaderName        : IBM437
WebName           : IBM437
WindowsCodePage   : 1252
IsBrowserDisplay  : False
IsBrowserSave     : False
IsMailNewsDisplay : False
IsMailNewsSave    : False
EncoderFallback   : System.Text.InternalEncoderBestFitFallback
DecoderFallback   : System.Text.InternalDecoderBestFitFallback
IsReadOnly        : True
CodePage          : 437
Sujeet Padhi
  • 254
  • 1
  • 20
  • Could you try with `-ContentType 'Application/json; charset=utf8'` in `Invoke-RestMethod` ? – Zilog80 Apr 20 '21 at 09:39
  • @Zilog80 : No Luck. I have already tried it. – Sujeet Padhi Apr 20 '21 at 10:07
  • FYI, "–" is `U+2013` _En Dash_ while `U+2014` _Em Dash_ is "—"… – JosefZ Apr 20 '21 at 15:17
  • In any case, `'\u2013\u2014'.encode('utf-8').decode('latin1','ignore')` returns `'â\x80\x93â\x80\x94'` and non-printable characters `\x80`, `\x93` and `\x94` are choked down so that you see merely `â` (try yourself: `print('\u2013\u2014'.encode('utf-8').decode('latin1','ignore'))`). Examples in Python as it's more intuitive and clear than the same functionality in Posh… – JosefZ Apr 20 '21 at 15:34

1 Answers1

2

I've answered a couple of similar questions here in the past (see https://stackoverflow.com/a/58542493/3156906 and https://stackoverflow.com/a/66404671/3156906), so I think the problem is a combination of a few things::

  • The server is sending a response encoded as utf-8, but not adding a charset parameter in the Content-Encoding header
  • In the absence of a charset, PowerShell is following the HTTP spec and decoding as ISO-8859-1, which ends up with a mangled string that you're writing out to the console verbatim
  • Postman is possibly detecting somehow that the response is utf-8 even though there's no charset, and is decoding the response stream fine

Of course, if there is a charset parameter then the rest of this answer is nonsense!

Anyway, here's a simple bit of script to reproduce the issue:

# server encodes response text using utf8
PS> $text = "`u{2014}"; # em dash
PS> $bytes = [System.Text.Encoding]::UTF8.GetBytes($text);
PS> write-host $bytes;
226 128 148

# client (Invoke-RestMethod) decodes bytes as ISO-8859-1
PS> $text = [System.Text.Encoding]::GetEncoding("ISO-8859-1").GetString($bytes);
PS> write-host $text;
â

Unfortunately in your case, the mangling isn't reversible because as @JosefZ noted in the comments, some of the encoded bytes are "choked down" (i.e. discarded) when the bytes stream is decoded.

All I can really suggest is:

  • Fix the API (if you have access) so it sends a "charset=utf-8" parameter or,
  • Maybe hard-code some special handling to fix up known bad names before downstream processing takes place
  • Alternatively, use the -OutFile parameter for Invoke-RestMethod to write the response bytes into a file without decoding them, and then read that back in as a utf-8 encoded file.

Incidentally, here's a script I've used previously to detect what encoding/decoding pair results in a given mangling - I've written it from scratch each time so I might as well post it here this time so I can find it again later :-).


$original = "`u{2014}"; # em dash
$mangled  = "`u{00E2}"; # circumflex a

$encodings = [System.Text.Encoding]::GetEncodings() | sort-object -Property "Name";
foreach( $source in $encodings )
{
    foreach( $target in $encodings )
    {
        $bytes = [System.Text.Encoding]::GetEncoding($source.Name).GetBytes($original);
        $text  = [System.Text.Encoding]::GetEncoding($target.Name).GetString($bytes);
        if( $text -eq $mangled )
        {
            write-host "original string = '$original'";
            write-host "mangled string  = '$mangled'";
            write-host "    source encoding = '$($source.Name)'";
            write-host "    target encoding = '$($target.Name)'";
        }
    }
}
mclayton
  • 8,025
  • 2
  • 21
  • 26