84

How can I convert this string:

This string contains the Unicode character Pi(π)

into an escaped ASCII string:

This string contains the Unicode character Pi(\u03a0)

and vice versa?

The current Encoding available in C# converts the π character to "?". I need to preserve that character.

DavidRR
  • 18,291
  • 25
  • 109
  • 191
Ali
  • 1,503
  • 2
  • 15
  • 20

9 Answers9

147

This goes back and forth to and from the \uXXXX format.

class Program {
    static void Main( string[] args ) {
        string unicodeString = "This function contains a unicode character pi (\u03a0)";

        Console.WriteLine( unicodeString );

        string encoded = EncodeNonAsciiCharacters(unicodeString);
        Console.WriteLine( encoded );

        string decoded = DecodeEncodedNonAsciiCharacters( encoded );
        Console.WriteLine( decoded );
    }

    static string EncodeNonAsciiCharacters( string value ) {
        StringBuilder sb = new StringBuilder();
        foreach( char c in value ) {
            if( c > 127 ) {
                // This character is too big for ASCII
                string encodedValue = "\\u" + ((int) c).ToString( "x4" );
                sb.Append( encodedValue );
            }
            else {
                sb.Append( c );
            }
        }
        return sb.ToString();
    }

    static string DecodeEncodedNonAsciiCharacters( string value ) {
        return Regex.Replace(
            value,
            @"\\u(?<Value>[a-zA-Z0-9]{4})",
            m => {
                return ((char) int.Parse( m.Groups["Value"].Value, NumberStyles.HexNumber )).ToString();
            } );
    }
}

Outputs:

This function contains a unicode character pi (π)

This function contains a unicode character pi (\u03a0)

This function contains a unicode character pi (π)

JNK
  • 63,321
  • 15
  • 122
  • 138
Adam Sills
  • 16,896
  • 6
  • 51
  • 56
  • 1
    DecodeEncodedNonAsciiCharacters will throw FormatException for strings like "\\user" – vovafeldman Sep 24 '12 at 09:45
  • 3
    \user shouldn't match because there aren't 4 characters after the u, but I get your point. Just change the regex character matching to [a-fA-F0-9]. It'll still match things it's not intended to match, but it seems like it still matches the original question's intent. – Adam Sills Sep 24 '12 at 16:11
  • 3
    Looks nice and clean. Still, I am surprised that there is no System .Net class that will do this. – saarp Nov 29 '12 at 20:15
  • Not really sure why there'd be a System class to do it. Not all languages use escape character sequences (VB.NET for example, does not). So it would be language specific. You might be able to use Microsoft.CSharp.CSharpCodeProvider to do it, but it seems like overkill. – Adam Sills Feb 15 '13 at 04:29
  • 3
    @AdamSills if a third party server is returning them they will need decoding. A static method would be nice within Net or Web class for times when you want to convert those characters. – James Jeffery Oct 10 '13 at 00:05
  • Thanks a lot! Wrrapped all this into a small Winforms app: helps me a lot in dealing with java translation properties... – odalet Feb 10 '15 at 14:48
  • Since C# `char` is 2 bytes there are some Unicode code points that span more than one `char`, since you're handing one `char` at a time I don't think this will work for surrogate pairs. – Motti Jul 14 '20 at 14:55
27

For Unescape You can simply use this functions:

System.Text.RegularExpressions.Regex.Unescape(string)

System.Uri.UnescapeDataString(string)

I suggest using this method (It works better with UTF-8):

UnescapeDataString(string)
MrRolling
  • 2,145
  • 1
  • 24
  • 33
12
string StringFold(string input, Func<char, string> proc)
{
  return string.Concat(input.Select(proc).ToArray());
}

string FoldProc(char input)
{
  if (input >= 128)
  {
    return string.Format(@"\u{0:x4}", (int)input);
  }
  return input.ToString();
}

string EscapeToAscii(string input)
{
  return StringFold(input, FoldProc);
}
leppie
  • 115,091
  • 17
  • 196
  • 297
4

As a one-liner:

var result = Regex.Replace(input, @"[^\x00-\x7F]", c => 
    string.Format(@"\u{0:x4}", (int)c.Value[0]));
Douglas
  • 53,759
  • 13
  • 140
  • 188
2
class Program
{
        static void Main(string[] args)
        {
            char[] originalString = "This string contains the unicode character Pi(π)".ToCharArray();
            StringBuilder asAscii = new StringBuilder(); // store final ascii string and Unicode points
            foreach (char c in originalString)
            {
                // test if char is ascii, otherwise convert to Unicode Code Point
                int cint = Convert.ToInt32(c);
                if (cint <= 127 && cint >= 0)
                    asAscii.Append(c);
                else
                    asAscii.Append(String.Format("\\u{0:x4} ", cint).Trim());
            }
            Console.WriteLine("Final string: {0}", asAscii);
            Console.ReadKey();
        }
}

All non-ASCII chars are converted to their Unicode Code Point representation and appended to the final string.

jdecuyper
  • 3,934
  • 9
  • 39
  • 51
2

Here is my current implementation:

public static class UnicodeStringExtensions
{
    public static string EncodeNonAsciiCharacters(this string value) {
        var bytes = Encoding.Unicode.GetBytes(value);
        var sb = StringBuilderCache.Acquire(value.Length);
        bool encodedsomething = false;
        for (int i = 0; i < bytes.Length; i += 2) {
            var c = BitConverter.ToUInt16(bytes, i);
            if ((c >= 0x20 && c <= 0x7f) || c == 0x0A || c == 0x0D) {
                sb.Append((char) c);
            } else {
                sb.Append($"\\u{c:x4}");
                encodedsomething = true;
            }
        }
        if (!encodedsomething) {
            StringBuilderCache.Release(sb);
            return value;
        }
        return StringBuilderCache.GetStringAndRelease(sb);
    }


    public static string DecodeEncodedNonAsciiCharacters(this string value)
      => Regex.Replace(value,/*language=regexp*/@"(?:\\u[a-fA-F0-9]{4})+", Decode);

    static readonly string[] Splitsequence = new [] { "\\u" };
    private static string Decode(Match m) {
        var bytes = m.Value.Split(Splitsequence, StringSplitOptions.RemoveEmptyEntries)
                .Select(s => ushort.Parse(s, NumberStyles.HexNumber)).SelectMany(BitConverter.GetBytes).ToArray();
        return Encoding.Unicode.GetString(bytes);
    }
}

This passes a test:

public void TestBigUnicode() {
    var s = "\U00020000";
    var encoded = s.EncodeNonAsciiCharacters();
    var decoded = encoded.DecodeEncodedNonAsciiCharacters();
    Assert.Equals(s, decoded);
}

with the encoded value: "\ud840\udc00"

This implementation makes use of a StringBuilderCache (reference source link)

Bill Barry
  • 3,423
  • 2
  • 24
  • 22
1

A small patch to @Adam Sills's answer which solves FormatException on cases where the input string like "c:\u00ab\otherdirectory\" plus RegexOptions.Compiled makes the Regex compilation much faster:

    private static Regex DECODING_REGEX = new Regex(@"\\u(?<Value>[a-fA-F0-9]{4})", RegexOptions.Compiled);
    private const string PLACEHOLDER = @"#!#";
    public static string DecodeEncodedNonAsciiCharacters(this string value)
    {
        return DECODING_REGEX.Replace(
            value.Replace(@"\\", PLACEHOLDER),
            m => { 
                return ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString(); })
            .Replace(PLACEHOLDER, @"\\");
    }
Irshad
  • 3,071
  • 5
  • 30
  • 51
vovafeldman
  • 585
  • 10
  • 17
1

To store actual Unicode codepoints, you have to first decode the String's UTF-16 codeunits to UTF-32 codeunits (which are currently the same as the Unicode codepoints). Use System.Text.Encoding.UTF32.GetBytes() for that, and then write the resulting bytes to the StringBuilder as needed,i.e.

static void Main(string[] args) 
{ 
    String originalString = "This string contains the unicode character Pi(π)"; 
    Byte[] bytes = Encoding.UTF32.GetBytes(originalString);
    StringBuilder asAscii = new StringBuilder();
    for (int idx = 0; idx < bytes.Length; idx += 4)
    { 
        uint codepoint = BitConverter.ToUInt32(bytes, idx);
        if (codepoint <= 127) 
            asAscii.Append(Convert.ToChar(codepoint)); 
        else 
            asAscii.AppendFormat("\\u{0:x4}", codepoint); 
    } 
    Console.WriteLine("Final string: {0}", asAscii); 
    Console.ReadKey(); 
}
Brijesh Bhatt
  • 3,810
  • 3
  • 18
  • 34
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
0

You need to use the Convert() method in the Encoding class:

  • Create an Encoding object that represents ASCII encoding
  • Create an Encoding object that represents Unicode encoding
  • Call Encoding.Convert() with the source encoding, the destination encoding, and the string to be encoded

There is an example here:

using System;
using System.Text;

namespace ConvertExample
{
   class ConvertExampleClass
   {
      static void Main()
      {
         string unicodeString = "This string contains the unicode character Pi(\u03a0)";

         // Create two different encodings.
         Encoding ascii = Encoding.ASCII;
         Encoding unicode = Encoding.Unicode;

         // Convert the string into a byte[].
         byte[] unicodeBytes = unicode.GetBytes(unicodeString);

         // Perform the conversion from one encoding to the other.
         byte[] asciiBytes = Encoding.Convert(unicode, ascii, unicodeBytes);

         // Convert the new byte[] into a char[] and then into a string.
         // This is a slightly different approach to converting to illustrate
         // the use of GetCharCount/GetChars.
         char[] asciiChars = new char[ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
         ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
         string asciiString = new string(asciiChars);

         // Display the strings created before and after the conversion.
         Console.WriteLine("Original string: {0}", unicodeString);
         Console.WriteLine("Ascii converted string: {0}", asciiString);
      }
   }
}
Irshad
  • 3,071
  • 5
  • 30
  • 51
JeffFerguson
  • 2,952
  • 19
  • 28
  • 5
    I tried this already. The issue with it is that it converts the unicode character π (\u03a0) into "?". I need it to convert it to "\u03a0". – Ali Oct 23 '09 at 20:34