-1

I have string which is 1252 encoded,how do I convert it into UTF-8 encoding Tried Encoding.Convert but getting same 1252 encoded string when printed

var destEncoding = Encoding.UTF8; // utf-8
var srcEncoding = Encoding.GetEncoding(1252); 
// convert the source bytes to the destination bytes
var destBytes = Encoding.Convert(srcEncoding, destEncoding, srcEncoding.GetBytes(srcString));

// process the byte[]
//File.WriteAllBytes("myFile", destBytes); // write it to a file OR ...
var destString = destEncoding.GetString(destBytes); // ... get the string
rac10
  • 43
  • 7
  • check this way https://stackoverflow.com/questions/1922199/c-sharp-convert-string-from-utf-8-to-iso-8859-1-latin1-h – Youness Abbassi Mar 18 '21 at 20:40
  • 2
    For the second time: **this has nothing to do with the string's encoding** (which is always UTF-16, it isn't an option) - what you mean is: the value of the string is a % url-encoded value, and *those* tokens are in an unexpected encoding. For that, you'll need to parse the %-encoded values (giving you bytes), decode them in one text encoding (giving you a string), encode them in a different text encoding (giving you bytes again), and finally re-apply %-encoding to those bytes (giving you a string); quite a lot of pieces! – Marc Gravell Mar 18 '21 at 20:43
  • `HttpUtility.UrlEncode(HttpUtility.UrlDecode(srcString,srcEncoding),destEncoding);`? – JosefZ Mar 18 '21 at 22:08
  • @YounessAbbassi ``` string Message=" C:/Users/%DCser"; Encoding iso = Encoding.GetEncoding("ISO-8859-9"); Encoding utf8 = Encoding.UTF8; byte[] utfBytes = iso.GetBytes(Message); byte[] isoBytes = Encoding.Convert(iso, utf8, utfBytes); string msg = utf8.GetString(isoBytes); Console.WriteLine(msg);``` this didn't work – rac10 Mar 19 '21 at 03:08
  • @MarcGravell Can you give example? – rac10 Mar 19 '21 at 03:11
  • @rac10 I'm not going to try to write it, but I would probably use regex to find any %-encoded portions, presumably `(\%[0-9][0-9])+` (untested), then pull out and parse every 2nd/3rd character, so for a match of length N (some multiple of 3), I get (N*2)/3 bytes, then use `encodingSource.GetString`, `encodingTarget.GetBytes`, then manually re-apply %-encoding, and return that from the regex replace callback, and: hopefully, done! If I was doing it at very high volumes, I'd try to use the buffer-based encoding calls, but for light usage: keep it simple (`string` and `byte[]`) – Marc Gravell Mar 19 '21 at 07:38
  • @MarcGravell I think I got the problem,I see URI object is typecasted to Object ,then I use Object.ToString() which might be converting it to UTF16 hence it might be encoding as %DC ,so is there method to convert from UTF16 to UTF8? – rac10 Mar 19 '21 at 13:46
  • @rac10 no, that is completely unrelated; for the **third** time - forget about the string encoding - that is not the problem; the problem is the *contents* of the string, which are the same any which way; as I said, you need to do the steps above! adding as an answer – Marc Gravell Mar 19 '21 at 15:48

2 Answers2

2

Code page 1252 is 8-bit. The visible escaping (%DC) looks more like it's URL encoded. See RFC3986 You can decode it like this:

    using System.Web;

    string inputString = "C:/Users/%DCser";     
    string decoded = HttpUtility.UrlDecode(inputString, Encoding.GetEncoding(1252));
    Console.WriteLine(decoded); 

The code above should output "c:/Users/Üser" without quotes. The string in this example will be UTF16-encoded since that's .NET's default encoding. So from here you can convert it to your destination encoding.

Community
  • 1
  • 1
Marco
  • 700
  • 5
  • 11
2

As I tried to explain in a comment, the real problem here is that you have a %-encoded string value, but using a different encoding to what you expected; to fix this, you need to:

  1. identify the %-encoded tokens in the source data
  2. parse out the bytes from the source %-encoded blocks
  3. decode those bytes using the source encoding
  4. re-encode those bytes using the destination encoding
  5. re-apply %-encoding of those bytes
  6. substitute those values back into the original string

For example (which changes "C:/Users/%C5%92ser" to "C:/Users/%8Cser"):

using System;
using System.Text;
using System.Text.RegularExpressions;

static class P
{
    static void Main()
    {
        var result = RewriteUrlPercentEncoding("C:/Users/%C5%92ser",
            Encoding.UTF8, Encoding.GetEncoding(1252));
        Console.WriteLine(result);
    }

    static string RewriteUrlPercentEncoding(string value, Encoding from, Encoding to)
        => Regex.Replace(value, @"(\%[0-9a-fA-F]{2})+", match => // #1
        {
            var s = match.Value;
            // #2
            var bytes = new byte[s.Length / 3];
            for (int i = 0; i < bytes.Length; i++)
            {
                byte hi = ParseNibble(s[(i * 3) + 1]),
                    lo = ParseNibble(s[(i * 3) + 2]);
                bytes[i] = (byte)((hi << 4) | lo);
            }
            // #3 and #4
            var reencoded = to.GetBytes(from.GetString(bytes));
            // #5
            var chars = new char[3 * reencoded.Length];
            int index = 0;
            for (int i = 0; i < reencoded.Length; i++)
            {
                var b = reencoded[i];
                chars[index++] = '%';
                chars[index++] = WriteNibble((byte)(b >> 4));
                chars[index++] = WriteNibble((byte)(b & 0b1111));
            }
            // #6
            return new string(chars);

            static byte ParseNibble(char c) => c switch
            {
                '0' => 0x0,
                '1' => 0x1,
                '2' => 0x2,
                '3' => 0x3,
                '4' => 0x4,
                '5' => 0x5,
                '6' => 0x6,
                '7' => 0x7,
                '8' => 0x8,
                '9' => 0x9,
                'A' => 0xA,
                'B' => 0xB,
                'C' => 0xC,
                'D' => 0xD,
                'E' => 0xE,
                'F' => 0xF,
                'a' => 0xA,
                'b' => 0xB,
                'c' => 0xC,
                'd' => 0xD,
                'e' => 0xF,
                'f' => 0xF,
                _ => throw new ArgumentOutOfRangeException(nameof(c)),
            };
            static char WriteNibble(byte b) => b switch
            {
                0x0 => '0',
                0x1 => '1',
                0x2 => '2',
                0x3 => '3',
                0x4 => '4',
                0x5 => '5',
                0x6 => '6',
                0x7 => '7',
                0x8 => '8',
                0x9 => '9',
                0xA => 'A',
                0xB => 'B',
                0xC => 'C',
                0xD => 'D',
                0xE => 'E',
                0xF => 'F',
                _ => throw new ArgumentOutOfRangeException(nameof(b)),
            };
        });
}

Note that the above is intended for simplicity rather than efficiency; for high volume work, there are many ways to improve this.

Similarly, reversing the encodings allows us to get from things like "C:/Users/%DCser" to "C:/Users/%C3%9Cser":

var result = RewriteUrlPercentEncoding("C:/Users/%DCser",
    Encoding.GetEncoding(1252), Encoding.UTF8);
Console.WriteLine(result);
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900