How to convert string from one encoding to another

Question

I have string which is 1252 encoded,how do I convert it into UTF-8 encoding Tried Encoding.Convert but getting same 1252 encoded string when printed

var destEncoding = Encoding.UTF8; // utf-8
var srcEncoding = Encoding.GetEncoding(1252); 
// convert the source bytes to the destination bytes
var destBytes = Encoding.Convert(srcEncoding, destEncoding, srcEncoding.GetBytes(srcString));

// process the byte[]
//File.WriteAllBytes("myFile", destBytes); // write it to a file OR ...
var destString = destEncoding.GetString(destBytes); // ... get the string

check this way https://stackoverflow.com/questions/1922199/c-sharp-convert-string-from-utf-8-to-iso-8859-1-latin1-h — Youness Abbassi, Mar 18 '21 at 20:40
For the second time: **this has nothing to do with the string's encoding** (which is always UTF-16, it isn't an option) - what you mean is: the value of the string is a % url-encoded value, and *those* tokens are in an unexpected encoding. For that, you'll need to parse the %-encoded values (giving you bytes), decode them in one text encoding (giving you a string), encode them in a different text encoding (giving you bytes again), and finally re-apply %-encoding to those bytes (giving you a string); quite a lot of pieces! — Marc Gravell, Mar 18 '21 at 20:43
`HttpUtility.UrlEncode(HttpUtility.UrlDecode(srcString,srcEncoding),destEncoding);`? — JosefZ, Mar 18 '21 at 22:08
@YounessAbbassi ``` string Message=" C:/Users/%DCser"; Encoding iso = Encoding.GetEncoding("ISO-8859-9"); Encoding utf8 = Encoding.UTF8; byte[] utfBytes = iso.GetBytes(Message); byte[] isoBytes = Encoding.Convert(iso, utf8, utfBytes); string msg = utf8.GetString(isoBytes); Console.WriteLine(msg);``` this didn't work — rac10, Mar 19 '21 at 03:08
@rac10 I'm not going to try to write it, but I would probably use regex to find any %-encoded portions, presumably `(\%[0-9][0-9])+` (untested), then pull out and parse every 2nd/3rd character, so for a match of length N (some multiple of 3), I get (N*2)/3 bytes, then use `encodingSource.GetString`, `encodingTarget.GetBytes`, then manually re-apply %-encoding, and return that from the regex replace callback, and: hopefully, done! If I was doing it at very high volumes, I'd try to use the buffer-based encoding calls, but for light usage: keep it simple (`string` and `byte[]`) — Marc Gravell, Mar 19 '21 at 07:38
@MarcGravell I think I got the problem,I see URI object is typecasted to Object ,then I use Object.ToString() which might be converting it to UTF16 hence it might be encoding as %DC ,so is there method to convert from UTF16 to UTF8? — rac10, Mar 19 '21 at 13:46
@rac10 no, that is completely unrelated; for the **third** time - forget about the string encoding - that is not the problem; the problem is the *contents* of the string, which are the same any which way; as I said, you need to do the steps above! adding as an answer — Marc Gravell, Mar 19 '21 at 15:48

score 2 · Accepted Answer · edited Oct 07 '21 at 11:02

2

Code page 1252 is 8-bit. The visible escaping (%DC) looks more like it's URL encoded. See RFC3986 You can decode it like this:

    using System.Web;

    string inputString = "C:/Users/%DCser";     
    string decoded = HttpUtility.UrlDecode(inputString, Encoding.GetEncoding(1252));
    Console.WriteLine(decoded);

The code above should output "c:/Users/Üser" without quotes. The string in this example will be UTF16-encoded since that's .NET's default encoding. So from here you can convert it to your destination encoding.

edited Oct 07 '21 at 11:02

Community

1
1

answered Mar 18 '21 at 20:38

Marco

700
5
11

This is giving output C:/Users/?ser – rac10 Mar 19 '21 at 06:05
I exactly ran same statements but getting different output as above – rac10 Mar 19 '21 at 11:07
@rac10 sorry, my bad, I forgot to add the 1252 encoding. Fixed my answer. – Marco Mar 20 '21 at 10:05
thanks will try,can we do this also using URI class? – rac10 Mar 20 '21 at 10:13
Using Encoding.GetEncoding(1252) will not cause issue in Chinese Japanese languages? – rac10 Mar 27 '21 at 04:24

Marc Gravell · Answer 2 · 2021-03-19T15:59:50.150

As I tried to explain in a comment, the real problem here is that you have a %-encoded string value, but using a different encoding to what you expected; to fix this, you need to:

identify the %-encoded tokens in the source data
parse out the bytes from the source %-encoded blocks
decode those bytes using the source encoding
re-encode those bytes using the destination encoding
re-apply %-encoding of those bytes
substitute those values back into the original string

For example (which changes "C:/Users/%C5%92ser" to "C:/Users/%8Cser"):

using System;
using System.Text;
using System.Text.RegularExpressions;

static class P
{
    static void Main()
    {
        var result = RewriteUrlPercentEncoding("C:/Users/%C5%92ser",
            Encoding.UTF8, Encoding.GetEncoding(1252));
        Console.WriteLine(result);
    }

    static string RewriteUrlPercentEncoding(string value, Encoding from, Encoding to)
        => Regex.Replace(value, @"(\%[0-9a-fA-F]{2})+", match => // #1
        {
            var s = match.Value;
            // #2
            var bytes = new byte[s.Length / 3];
            for (int i = 0; i < bytes.Length; i++)
            {
                byte hi = ParseNibble(s[(i * 3) + 1]),
                    lo = ParseNibble(s[(i * 3) + 2]);
                bytes[i] = (byte)((hi << 4) | lo);
            }
            // #3 and #4
            var reencoded = to.GetBytes(from.GetString(bytes));
            // #5
            var chars = new char[3 * reencoded.Length];
            int index = 0;
            for (int i = 0; i < reencoded.Length; i++)
            {
                var b = reencoded[i];
                chars[index++] = '%';
                chars[index++] = WriteNibble((byte)(b >> 4));
                chars[index++] = WriteNibble((byte)(b & 0b1111));
            }
            // #6
            return new string(chars);

            static byte ParseNibble(char c) => c switch
            {
                '0' => 0x0,
                '1' => 0x1,
                '2' => 0x2,
                '3' => 0x3,
                '4' => 0x4,
                '5' => 0x5,
                '6' => 0x6,
                '7' => 0x7,
                '8' => 0x8,
                '9' => 0x9,
                'A' => 0xA,
                'B' => 0xB,
                'C' => 0xC,
                'D' => 0xD,
                'E' => 0xE,
                'F' => 0xF,
                'a' => 0xA,
                'b' => 0xB,
                'c' => 0xC,
                'd' => 0xD,
                'e' => 0xF,
                'f' => 0xF,
                _ => throw new ArgumentOutOfRangeException(nameof(c)),
            };
            static char WriteNibble(byte b) => b switch
            {
                0x0 => '0',
                0x1 => '1',
                0x2 => '2',
                0x3 => '3',
                0x4 => '4',
                0x5 => '5',
                0x6 => '6',
                0x7 => '7',
                0x8 => '8',
                0x9 => '9',
                0xA => 'A',
                0xB => 'B',
                0xC => 'C',
                0xD => 'D',
                0xE => 'E',
                0xF => 'F',
                _ => throw new ArgumentOutOfRangeException(nameof(b)),
            };
        });
}

Note that the above is intended for simplicity rather than efficiency; for high volume work, there are many ways to improve this.

Similarly, reversing the encodings allows us to get from things like "C:/Users/%DCser" to "C:/Users/%C3%9Cser":

var result = RewriteUrlPercentEncoding("C:/Users/%DCser",
    Encoding.GetEncoding(1252), Encoding.UTF8);
Console.WriteLine(result);

How to convert string from one encoding to another

2 Answers2