Why does my string containing "é" character gets outputed as "Ú"?

Question

Here's the situation :

I have an UiPath process containing an Invoke Power Shell activity
In the Invoke Power Shell activity (set as IsScript), here's my script python '/path/to/my/script.py'. The ouput is saved as string

When I run my "script.py" file in any Powershell from my computer, the output I get is "Cédric" but when I run the script through UiPath, the output I get is "CÚdric". I understand that the issue is somehow related to the encoding.

After some researchs, I found out that running this Powershell command line [System.Text.Encoding]::Default.EncodingName, I get different results :

In my system Powershell : "Western Europe (Windows)"
In UiPath Powershell : "Unicode (UTF-8)"

I found out that the HEX adress of "é" is E9 when using Windows-1252 encoding. But in CP850 encoding, E9 is "Ú". So I guess this is the encoding relation I'm looking for. THOUGH, I tried many things in UiPath (C#) and Powershell commands, but nothing did resolve my problem. (tried both changing encoding values or converting string into bytes to change encoding output)

And to anticipate some questions :

No, I'll not use "Invoke Python Script" activity in UiPath as it's broken
Yes, I need to use this Python script
Yes, I could use a "replace("Ú","é")" on the string output BUT I don't wan't to do it dumbly for every special character that could come, even more when there's a logical reason behind it

TLDR : Basically, the issue is located when UiPath interprets the Powershell console running the Python script

I've been stuck on that for 3 days now, only to get 2% more precise on the project I work (which is completely fine other than that); so it's not worth the time I spend on it, but I need to know

Switch to UTF-8 everywhere in your script and shells, and you should have consistent results. Add a piece of code we can try, and help if you want more help. See [MRE] The Python script first line shoud look like: # -*- coding: utf-8 -*- — Malo, May 11 '23 at 12:39
You are using the wrong Font. The font you are using doesn't contain the character you are using. — jdweng, May 11 '23 at 13:42
@jdweng: _Character-encoding mismatches_, such as described in this question, are unrelated to _fonts_. The font would only be a problem if a character didn't _render_ as such, using a _placeholder_ (`? `) that signals that the font doesn't include a glyph for the character's code point. Please consider no longer posting comments that conflate these two unrelated aspects, so as not to confuse others. — mklement0, May 11 '23 at 15:23
@mklement0 : Both characters are are 0xE9. There is no way of telling the difference except the Font that is used. If the source was a text file with UTF-8 Encoding, then the Font determines the way code is displayed. There is a glyph in the font and doesn't display the question mark (displays wrong character). — jdweng, May 11 '23 at 15:59
@jdweng: There's only _one byte_ at issue here, with value `0xE9`. How PowerShell _translates this byte value into a .NET `[char]` instance (a UTF-16 code unit)_ depends on the value of `[Console]::OutputEncoding`. Python outputs byte `0xE9` meaning it to be character `é`, due to using Windows-1252 encoding. PowerShell, _misinterprets_ this byte as referring to character `Ú`, because it decodes the byte as CP850. It is purely this encoding mismatch, resulting in _different characters_, that causes the problem, which has nothing to with fonts. — mklement0, May 11 '23 at 16:28
@jdweng: To spell out the misinterpretation: `[Text.Encoding]::GetEncoding(1252).GetString([byte[]] 0xe9)` (-> `é`) vs. `[Text.Encoding]::GetEncoding(850).GetString([byte[]] 0xe9)` (-> `Ú`) — mklement0, May 11 '23 at 16:32
@mklement0: That is only if you are using Window Encoding. If you have UTF-8 and output data it will use the font of the app. — jdweng, May 11 '23 at 17:11
@jdweng: honestly, I don't even know how to interpret your comment. A given (Unicode) _character_ (as opposed to a _byte or byte sequence_ that must first be _interpreted as_ a character and can therefore be interpreted differently) always renders _the same_ (though possibly not as itself, if the font doesn't have a glyph for it). If you think that any of my arguments are incorrect, dispute them specifically - your comment doesn't do that. — mklement0, May 11 '23 at 18:13
@mklement0 : A character in Net/Core is a class object that can be one or two bytes along with a property indicating if the character is one or two bytes. A string is an array of characters. When transmitting (or reading files) characters/strings. you are sending bytes, not the class object. The OP is starting with bytes (not char/string) and is using PS (not necessarily Net/Core). So if OP is displaying bytes than encoding does not apply. Suppose the OP opens the bytes in Notepad (using UTF-8) what will he see? It will depend on the Font that is used. — jdweng, May 12 '23 at 05:03
@jdweng: [`System.Char`](https://learn.microsoft.com/en-US/dotnet/api/System.Char) is a value type, not a class, and its instance are unsigned 16-bit integer value that constitutes a UTF-16 code unit - a .NET character isn't made up of two bytes, and it has no properties (let alone one indicating that indicates a byte count, which would be meaningless) Fonts render _characters_, not _bytes_. — mklement0, May 12 '23 at 12:16
@jdweng: An external process' (or file's) _byte stream_ may or may not encode characters (text); here it does: it is the Windows-1252 encoding of the string `Cédric`. For this byte stream to be retransformed into the original string, the _same_ encoding must be used to decode it, and because PowerShell uses a _different_ encoding, a _different string_ results. That is the only problem here. — mklement0, May 12 '23 at 12:17
@jdweng: There is no way to display _bytes_ as such - you can either choose to show their _values_ or you can interpret them _as characters_, which is what text editors such as Notepad do. Interpretation as characters requires choosing an encoding (based on a default or the presence of a BOM). For a given encoding, the characters render the same, regardless of what font is used (leaving incidental variations in appearance aside). — mklement0, May 12 '23 at 12:18
@mklement0 : Microsoft has brainwashed you into thinking everything needs encoding. I'm from the Unix world that know how to separate layers. Yes when you are looking at this issue as 16 bit data where everything is unicode character it is encoding. But when you look at it from the real world of bytes it is a Font issue. If you sent byte data to a printer that is using one Font the character would look one way, then you change the Font on the printer and the character would look different. The bytes do not change. — jdweng, May 13 '23 at 07:54
@mklement0 : Encoding was designed to save bytes by mapping 8 bit characters to 16 bit characters. The ASCII character 0x80 to 0xFF are mapped differently depending on the encoding so you only need to represent unicode character with 8 bits, but you do not get every unicode character, just 64 of the characters with each type of encoding. — jdweng, May 13 '23 at 07:57
@jdweng: The discussion focused on Windows, but it applies to Unix as well. Summary: In modern OSs, it is solely the _character encoding_ that determines how raw byte sequences are interpreted as [_Unicode code points_](https://en.wikipedia.org/wiki/Unicode#Codespace_and_Code_Points) (characters), and the choice of _font_ only determines _stylistic variations_ of those characters. The only exceptions are _symbol fonts_, such as [Wingdings](https://en.wikipedia.org/wiki/Wingdings) on Windows, which redefine the meaning of code points in order to be able to render specialty symbols. — mklement0, May 13 '23 at 17:58
@jdweng: I can't make sense of your comments about encoding. Again: if you find fault with any of my arguments, you need to address them specifically. Also, please refrain from incendiary phrases such as "brainwashed" and ad-hominem arguments in general. I strongly encourage you to spend more time familiarizing yourself with the fundamentals before posting future comments - you're doing the community a real disservice. — mklement0, May 13 '23 at 18:01
@mklement0m : You are brainwashed my Microsoft. I went to graduate school where most of the Professors were from Bell Labs. There were Scientist not the simple programmers at Microsoft. A console in Windows has a Font option. And if you change the Font and open a UTF-8 encoded file the characters 0x80 to 0x FF will appear different depending on the Font. A German, Italian, French Font will each show different character for the same byte. A byte is a byte and the Font determines how it appears. There is no need for encoding except in the Microsoft World. — jdweng, May 13 '23 at 18:57
@jdweng: There is no such thing as a "German, Italian, French font". Encoding applies to _every_ platform: deciding what _character_ a byte value or sequence represents can only be done on the basis of a _character encoding_ (aka _codeset_, _charmap_). Try it for yourself: Run `'hü' | Set-Content -Encoding utf8 test.txt; notepad test.txt` and then change to any other (non-symbol) font - you'll see differences in _styling_, but the characters will remain the same. What schooling you had and who your professors were is irrelevant. Stop the ad-hominem arguments. — mklement0, May 13 '23 at 19:45

mklement0 · Accepted Answer · 2023-05-12T12:07:29.150

^{As for [System.Text.Encoding]::Default: That you're seeing UTF-8 as the value in UiPath implies that it is using PowerShell (Core) 7+ (pwsh.exe), the modern, install-on-demand, cross-platform edition built on .NET 5+, whereas Windows PowerShell (powershell.exe), the legacy, ships-with-Windows, Windows-only edition is built on .NET Framework.}

PowerShell honors the system's active legacy OEM code page by default when interpreting output from external programs (such as Python scripts),^[1] e.g. 850, as reported by chcp, and as reflected in [Console]::OutputEncoding from inside PowerShell.
- That is, PowerShell interprets the byte stream received from external programs as text encoded according to [Console]::OutputEncoding, and decodes it that way, resulting in a Unicode in-memory string representation, given that PowerShell is built on .NET whose strings are composed of UTF-16 Unicode code units ([char]). If [Console]::OutputEncoding doesn't match the actual encoding that the external program uses, misinterpreted text can be the result, as in your case.^[2]
  - Note: This interpretation only comes into play when PowerShell either captures or redirects output from an external program. Otherwise, the output prints directly to the console and the problem may not surface there. For example, running python script.py results in Cédric printing to the console, but python script.py | Write-Output - due to use of a pipeline - involves interpretation by PowerShell, and the encoding mismatch would result in CÚdric
- A UTF-8 opt-in is available:
  - Execute the following in PowerShell, before calling the Python script (see this answer for background information):
```
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
```
Python, by contrast, defaults to the system's active legacy ANSI code page (e.g. Windows-1252).^[3]
- A UTF-8 opt-in is available, either:
  - By defining environment variable PYTHONUTF8 with value 1: Before calling your Python script, execute $env:PYTHONUTF8=1 in PowerShell.
  - Or, in Python 3.7+, with explicit python CLI calls, by using the -X utf8 option (case matters).

Note:

Given the above - assuming that your Python script only ever outputs characters that are part of the Windows-1252 code page - the alternative is to leave Python at its defaults and (temporarily) set the console encoding to Windows-1252 instead of UTF-8:
```
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding(1252)
```

There is an option to NOT require this configuration, by configuring Windows to use UTF-8 system-wide, as described in this answer, which sets both the active OEM and the active ANSI code page to 65001, i.e. UTF-8.

Caveat: This feature - still in beta as of Windows 11 22H2 - has far-reaching consequences:
- It causes preexisting, BOM-less files encoded based on the culture-specific ANSI code page (e.g. Windows-1252) to be misinterpreted by default by Windows PowerShell, Python, and generally all non-Unicode Windows applications.
- Note that .NET applications, including PowerShell (Core) 7+ (but not Windows PowerShell),^[1] - have the inverse problem that they must deal with irrespective of this setting: Because they assume that a BOM-less file is UTF-8-encoded, they must specify the culture-specific legacy ANSI code page explicitly when reading such files.

^{[1] PowerShell-native commands and scripts, which run in-process, consistently communicate text via in-memory Unicode strings, due to using .NET strings, so no encoding problems can arise.

When it comes to reading files, Windows PowerShell defaults to the ANSI code page when reading source code and text files with Get-Content, whereas PowerShell (Core) 7+ now - commendably - consistently defaults to UTF-8, also with respect to what encoding is used to write files - see this answer for more information.}

^{[2] Specifically, Python outputs byte 0xE9 meaning it to be character é, due to using Windows-1252 encoding. PowerShell, misinterprets this byte as referring to character Ú, because it decodes the byte as CP850, as reflected in [Console]::OutputEncoding. Compare [Text.Encoding]::GetEncoding(1252).GetString([byte[]] 0xE9) (-> é, whose Unicode code point is 0xE9 too, because Unicode is mostly a superset of Windows-1252) to [Text.Encoding]::GetEncoding(850).GetString([byte[]] 0xE9) (-> Ú, whose Unicode code point is 0xDA)}

^{[3] This applies when its stdout / stderr streams are connected to something other than a console, such when their output is captured by PowerShell.}

Why does my string containing "é" character gets outputed as "Ú"?

1 Answers1