1

I have a problem with getting data with Polish diacritics from an Invoke webrequest or Invoke-Restmethod. WHen I retrieve the data I am getting strange characters instead of the correct Polish diacritics. for example: Plec : MÄżczyzna

When I try the same web request in postman I get the correct diacritics: "Plec": "Mężczyzna",

when I copy the PowerShell script that is created via Postman I do not get the correct diacritics. I have added this into the body:

`$body = [System.Text.Encoding]::UTF8.GetBytes($body)`

And also changed the headers to:

`$headers = @{
            "Content-Type"="application/json; charset=utf-8";
            "OData-MaxVersion"="4.0";
            "OData-Version"="4.0";
        };`

This is the request:

`$response = Invoke-RestMethod 'https://<URL>/api/MethodInvoker/InvokeServiceMethod' -Method 'POST' -       Headers $headers -Body $body
`

I tried to use Postman, several different encodings, changed headers, etc.

Hans Aben
  • 11
  • 2
  • In short: In the absence of `charset` information in the _response_ header, _Windows PowerShell_ defaults to ISO-8859-1 character encoding. _PowerShell (Core) 7.0 - 7.3.x_ default to UTF-8 _for JSON only_, v7.4+ will _generally_ default to UTF-8. If the de-facto encoding differs (e.g, UTF-8) and you have no way of getting the service to include `charset` information to its response, you'll have to _manually_ decode the response using the encoding of choice, based on the response body's _raw bytes_, via `Invoke-WebRequest`. See the linked duplicates for details. – mklement0 May 04 '23 at 17:14

1 Answers1

2

tl;dr

This normally indicates the server is encoding the content into a response byte stream in one format (e.g. utf8) but the client is decoding the byte stream using a different format (e.g. iso-8859-1). As a result, the content decoded by the client doesn't match the original content encoded by the server.

This snippet shows the effect in action:

$originalContent = "Mężczyzna";

# encode with utf8
$encodedBytes = [System.Text.Encoding]::UTF8.GetBytes($originalContent);

# decode with iso-8859-1
$decodedContent = [System.Text.Encoding]::GetEncoding("iso-8859-1").GetString($encodedBytes)

$decodedContent
# MÄżczyzna

Unfortunately it's not 100% guaranteed to be able to reverse the process - mis-decoding some inputs is lossy so you can't always recover the original content by reversing the decoding and encoding steps, but if you write the response to disk PowerShell will just stream the raw response bytes into a file, and you can read it back using the server's encoding format to recover the original content:

$filename = "c:\temp\response.txt";

$response = Invoke-RestMethod `
    -Uri     "https://<URL>/api/MethodInvoker/InvokeServiceMethod" `
    -Method  "POST" `
    -Headers $headers `
    -Body    $body `
    -OutFile $filename;
#   ^^^^^^^^ ^^^^^^^^^
#   write the raw byte stream to disk without (mis-)decoding it

$text = Get-Content $filename

More Details

The root problem seems to be caused by different interpretations of what the default encoding should be for some content types - for example:

Content-Type: application/json

Some systems (including Windows PowerShell) appear to use an older heuristic that assumes content is encoded using iso-8859-1 unless a charset optional parameter is specified on the content type - see RFC2616: Hypertext Transfer Protocol -- HTTP/1.1

When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.

For example if Windows PowerShell receives a response with this header:

Content-Type: application/json

it will treat it like:

Content-Type: application/json;charset=iso-8859-1

whereas if the response contains this header:

Content-Type: application/json;charset=utf-8

Windows PowerShell will use utf8 to decode it instead.

This interpretation was superseded in RFC7321: Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content where it says:

The default charset of ISO-8859-1 for text media types has been removed; the default is now whatever the media type definition says.

and since the spec for RFC8259: The JavaScript Object Notation (JSON) Data Interchange Format says:

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].

that's what some clients do, so for those systems this:

Content-Type: application/json

is treated like

Content-Type: application/json;charset=utf-8

and they use utf8 even if no charset is specified.

You could fix the original issue by getting the owner of the website / api to add the charset=utf-8 optional parameter onto the content-type header which would improve interoperability with some clients, but it's not strictly necessary according to the various specs, and may not be straightforward to get applied if the site is owned by a third party.

And based on the above, the reason the Content-Type: application/json response header works in Postman is probably that it uses the newer interpretation of the specs and assumes utf8 encoding for application/json, whereas Windows PowerShell is using the older interpretation of iso-8859-1 encoding.

For reference, this GitHub issue was the key to understanding all of this behaviour.

Finally...

...if you want a script to help debug these sorts of issues in future I wrote one a while ago in this answer - https://stackoverflow.com/a/67182420/3156906. It takes the original text and the mangled text and tries to work out what pair of mismatched encoding / decoding were used mangle the text. When I ran it with your text it gave me this:

original string = 'Mężczyzna'
mangled string  = 'MÄżczyzna'
    source encoding = 'utf-8'
    target encoding = 'iso-8859-1'
original string = 'Mężczyzna'
mangled string  = 'MÄżczyzna'
    source encoding = 'utf-8'
    target encoding = 'iso-8859-13'
original string = 'Mężczyzna'
mangled string  = 'MÄżczyzna'
    source encoding = 'utf-8'
    target encoding = 'iso-8859-9'
mclayton
  • 8,025
  • 2
  • 21
  • 26
  • The state of things in relation to PowerShell editions / versions: v7+ defaults to UTF-8 for JSON only, v7.4+ will be UTF-8 for all media types. Since ISO-8859-1 can represent _all_ 8-bit values, re-encoding of mis-decoded strings should work robustly, The first linked duplicate show an alternative based on decoding the _raw byte stream_ directly (come to think of it: not sure about performance implications,, but at least no temporary file is required). – mklement0 May 04 '23 at 17:34
  • Saying that it isn't PowerShell's fault may be a bit too generous, at least in the case of JSON: the [RFC](https://www.ietf.org/rfc/rfc4627.txt) is from 2006 and stipulates UTF-8 as the default, and generally limits support to UTF-8, UTF-16, or UTF-32. – mklement0 May 04 '23 at 17:37
  • @mklement0 - sure, but the HTTP protocol RFC (/me handwaves away the details:-)) also says the default encoding in lieu of any overrides is “iso-8859-1”, so I’d argue (maybe a bit pedantically, tbf) the HTTP response being sent from the server is technically the root cause as it’s encoding in one format but declaring another (or more likely, not declaring *any*). Powershell is processing the response “correctly” according to the HTTP spec, but the encoded response doesn’t actually represent the original unencoded data. – mclayton May 04 '23 at 18:11
  • But doesn't the use of a specific media type with _its own_ default override the (now obsolete) general ISO-8859-1 HTTP default? – mklement0 May 04 '23 at 18:14
  • Granted, the PowerShell cmdlets, as generalists, would have to know all media types to honor specific defaults (clearly, JSON was accommodated explicitly in 7.0), but this will be moot in v7.4+ when we'll live in happy UTF-8-always times. – mklement0 May 04 '23 at 18:18
  • Well that’s a good question :-)… – mclayton May 04 '23 at 18:19
  • 1
    @mklement0 - I’ve not checked the links, but this seems like they know what they’re talking about… https://github.com/dart-lang/http/issues/175#issuecomment-619593544. I’ll refine my answer after I’ve finished my kids bedtime… – mclayton May 04 '23 at 18:45
  • Good info, thanks; there's also a dedicated RFC: [RFC 2046](https://tools.ietf.org/html/rfc6657); if I read it correctly, it says: Each media type should define its own behavior, and either _mandate_ a `charset` attribute or _not use it_, which assumes in-band specification (such as an XML document's `encoding` attribute in the XML declaration). If there's "a strong reason" to define a default nonetheless, UTF-8 should be used. – mklement0 May 04 '23 at 20:44
  • @mklement0 - my "refining" ended up being a fairly big rewrite... – mclayton May 04 '23 at 23:50