0

I have to read a bad encoded string from a remote service and can not figure out how to recover the correct value in C# or Javascript. I can neither change the values in the service or change the way they are being saved in the DB, but I need to display them correctly.

Bad string: Adrián José
Correct string: Adrián José

The error can be undone since the fixed value can be obtained using tools such as https://www.iosart.com/tools/charset-fixer or in Notepad++ by changing the Encoding from ANSI to UTF-8.

So far, I have this solution in JS (client side), but I don't like to use the escape() function and would like to do the fix on server side.

var badString = "Adrián José";
var fixedString = decodeURIComponent(escape(badString)); // "Adrián José"

I tried to play with the Encoding class in C# (like here), but couln't find a valid combination.

var badString = "Adrián José";
var origEnco = Encoding.UTF8;
var targetEnco = Encoding.Default;
byte[] utfBytes = origEnco.GetBytes(badString);
byte[] isoBytes = Encoding.Convert(origEnco, targetEnco, utfBytes);
string fixedString = targetEnco.GetString(isoBytes); // "Adrián José"

What am I missing? How do the character set fixer or Notepad++ work?

Thomas Dickey
  • 51,086
  • 7
  • 70
  • 105
adrianjgp
  • 3
  • 2
  • 1
    Do you know the wrong encoding that was used to produce the string you are getting? And do you know the *correct* encoding that should be used to decode the string? – Sweeper Apr 27 '23 at 01:28
  • 1
    How are you obtaining the string in the first place? Could it be that that's where the problem lies? One scenario I can imagine: using HttpClient's `.Content.ReadAsStringAsync()` when the content isn't UTF8-encoded. – ProgrammingLlama Apr 27 '23 at 01:34
  • You may need to loop through all possible *pairs* of encodings to figure out how to reconstruct the string. Take a look at the first part of [this answer](https://stackoverflow.com/a/39510190/3744182) to [filter invalid values in json string](https://stackoverflow.com/q/39509138/3744182) for an example. – dbc Apr 27 '23 at 01:40
  • Thank you all for answering. I don't know the actual encoding of the string in the DB (inside a JSON). I also tried the suggested loop before you provided the final solution. – adrianjgp Apr 27 '23 at 02:31

1 Answers1

1

For your provided example, this code works and outputs "Adrián José" as expected:

var currentEncoding = Encoding.GetEncoding("Windows-1252");
var targetEncoding = Encoding.UTF8;
string input = "Adrián José";
string output = targetEncoding.GetString(currentEncoding.GetBytes(input));

If you're using .NET Core/.NET 5+ then you'll need to install System.Text.Encoding.CodePages from NuGet and add this somewhere in your code (I usually do it at the top of my Main method):

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

While this provides the result you're interested in, I don't know if it will work for all instances of your bad text.

If you can, I would fix the problem at the source, rather than trying to fix it once you have the incorrectly-encoded string.

ProgrammingLlama
  • 36,677
  • 7
  • 67
  • 86
  • Thank you for your help! I was a little upset for not finding a solution for a relatively simple problem. It seems I didn't need to call the Convert function in this case. – adrianjgp Apr 27 '23 at 01:56