How to unescape multibyte unicode in c#

Question

The following unicode string from a text file encodes a single apostrophe using 3 bytes:

It\u00e2\u0080\u0099s working

This should decode to:

It’s working

How can I decode this string in C#?

For example, when I try the following code:

string test = @"It\u00e2\u0080\u0099s working";
string test2 = System.Text.RegularExpressions.Regex.Unescape(test);

it incorrectly decodes the first byte only:

Itâ\u0080\u0099s awesome

novice in DotNet · Answer 1 · 2021-01-18T13:30:38.917

1

This is UTF8. Try UTF8 Encoding

using System.Text;
using System.Text.RegularExpressions;
string test = "It\u00e2\u0080\u0099s working";
byte[] bytes = Encoding.GetEncoding(28591)
                           .GetBytes(test);
var converted = Encoding.UTF8.GetString(bytes);//It’s working

edited Jan 18 '21 at 13:30

answered Jan 18 '21 at 13:21

novice in DotNet

771
1
9
21

Thanks for your answer but this doesn't seem to change anything. The value of "converted" is: It\\u00e2\\u0080\\u0099s working – Stack Man Jan 18 '21 at 13:25
Works if you remove the `@` from the string literal and make it a "normal" string. – phuzi Jan 18 '21 at 13:25
OK - that works now - thanks. However, if I am reading this string from a file, how do I convert it from a literal into a normal string? – Stack Man Jan 18 '21 at 13:31
@StackMan string[] linesRead=System.IO.File.ReadAllLines(@"E:\input.txt",Encoding.GetEncoding(28591)); – novice in DotNet Jan 18 '21 at 13:40
That doesn't work unfortunately. Same problem. To simplify (we can forget the reading from file part), if we put the @ back at the beginning of the string: string test = @"It\u00e2\u0080\u0099s working"; how do we convert that literal string into the correct result: //It’s working – Stack Man Jan 18 '21 at 14:39
1

@StackMan string test = @"It\u00e2\u0080\u0099s working"; string unescaped=Regex.Unescape (test); byte[] bytes = Encoding.GetEncoding(28591) .GetBytes(unescaped); var converted = Encoding.UTF8.GetString(bytes);//it's working – novice in DotNet Jan 18 '21 at 16:47

PhazorP · Answer 2 · 2021-01-18T17:21:12.253

That is javascript unicode encoding. Use a C# javascript deserializer to convert it.

(I don't have enough reputation to comment, so I will write here)

Where did you get those characters from in the first place?

\uXXXX is an encoding used by JavaScript and C# (didn't know about C# this until now) to encode 16 bit Unicode characters in string literals. 16 bit - 4 hex characters, so \uXXXX, each X representing one Hexadecimal digit.

Note this is used to encode string literals in source code! It is not used to encode the bytes stored in files or memory or what not. It is an older style of encoding due to modern source code editors usually support UTF-8 or UTF-16 or some other encoding to be able to store unicode characters in source code files, and then they are also able to display the unicode character symbol, and allow it being typed right at the editor. So \uXXXX typing is not needed, and going out of style.

So that is why I asked where did you get the string initially? You wrote in one comment you read it from a file? What generated the file?

If each \uXXXX is taken alone by itself as unicode characters, which is what \uXXXX means, doesn't make sense being there. 00e2 is a character a with cap on it, 0080 and 0099 are control characters, not printable.

If e28099 are taken together as three single bytes, i.e. dropping off 00 valued first bytes of each as they are in the form of \u00XX then it fits as a UTF8 character representation of a Unicode character with decimal value 2019, which is "Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)" Then that is what you are looking for, but this doesn't seem correct usage of encoding that generated that string. If you end up with those strings and have to evaluate them, then comments above by "C# Novice" is working, but it may not work in every case.

You could convert string literals that uses \uXXXX encoding in its strings using a javascript script evaluator, or CSharpScript.Run() to make a string literal with those and assign to a variable, and then look at its bytes. But I tried that later and due to those byte values/characters not making sense I don't get anything meaningful from them. I get an a with a cap, and the next two, CSharpScript refuses to decode and leaves as is. Becuase those are control characters when decoded.

Here three different ways using C# avaliable libraries doing \uXXXX decoding. The first two uses NewtonSoft.JSON package, the last uses Roslyn/CSharpScript, both avalilable from Nuget. Note none of these print single aposthrope, due to what I described above. In contrast, if I change the string to "\u3053\u3093\u306B\u3061\u306F\u4E16\u754C!", it prints on the debug output window this Japanese text: "こんにちは世界!" , which is what Google translate told me is Japanese translation of "Hello World!"

https://translate.google.com/?sl=ja&tl=en&text=%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF%E4%B8%96%E7%95%8C!&op=translate

So in summary, whatever generated those scripts, doesn't seem to be doing standard things.

            string test = @"It\u00e2\u0080\u0099s working";

            // Using JSON deserialization, since \uXXXX is valid encoding JavaScript string literals
            // Have to add starting and ending quotes to make it a script literal definition, then deserialize as string
            var d = Newtonsoft.Json.JsonConvert.DeserializeObject("\"" + test + "\"", typeof(string));
            Console.WriteLine(d);
            System.Diagnostics.Debug.WriteLine(d);

            // Another way of JavaScript deserialization. If you are using a stream like reading from file this maybe better:
            TextReader reader = new StringReader("\"" + test + "\"");
            Newtonsoft.Json.JsonTextReader rdr = new JsonTextReader(reader);
            rdr.Read();
            Console.WriteLine(rdr.Value);
            System.Diagnostics.Debug.WriteLine(rdr.Value);

            // lastly overkill and too heavy:  Using Roslyn CSharpScript, and letting C# compiler to decode \uXXXX's in string literal:
            ScriptOptions opt = ScriptOptions.Default;
            //opt = opt.WithFileEncoding(Encoding.Unicode);
            Task<ScriptState<string>> task = Task.Run(async () => { return CSharpScript.RunAsync<string>("string str = \"" + test + "\".ToString();", opt); }).Result;
            ScriptState<string> s = task.Result;
            var ddd = s.Variables[0];
            Console.WriteLine(ddd.Value);
            System.Diagnostics.Debug.WriteLine(ddd.Value);

score 0 · Answer 3 · answered Jan 18 '21 at 13:19

try this to parse file :

private static Regex _regex = new Regex(@"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);
public string decodeString(string value)
{
    return _regex.Replace(
        value,
        m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
    );
}

How to unescape multibyte unicode in c#

3 Answers3