8

We have one text file which has the following text

"\u5b89\u5fbd\u5b5f\u5143"

When we read the file content in C# .NET it shows like:

"\\u5b89\\u5fbd\\u5b5f\\u5143"

Our decoder method is

public string Decoder(string value)
{
    Encoding enc = new UTF8Encoding();
    byte[] bytes = enc.GetBytes(value);
    return enc.GetString(bytes);
}

When I pass a hard coded value,

string Output=Decoder("\u5b89\u5fbd\u5b5f\u5143");

it works well, but when we use a variable value it is not working.

When we use the string this is what we get from the text file:

  value=(text file content)
  string Output=Decoder(value);

It returns the wrong output.

How can I fix this?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
PrateekSaluja
  • 14,680
  • 16
  • 54
  • 74

6 Answers6

17

Use the below code. This unescapes any escaped characters from the input string

Regex.Unescape(value);
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Sagar
  • 399
  • 4
  • 11
  • Thank you! I couldn't figure out why my WebClient wasn't properly outputting unicode characters in my string. I didn't even think about the /u being an escape character in the string until I saw your post and it clicked. – CJF Jan 25 '20 at 20:59
8

You could use a regular expression to parse the file:

private static Regex _regex = new Regex(@"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);

public string Decoder(string value)
{
    return _regex.Replace(
        value,
        m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
    );
}

And then:

string data = Decoder(File.ReadAllText("test.txt"));
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
3

So your file contains the verbatim string

\u5b89\u5fbd\u5b5f\u5143

in ASCII and not the string represented by those four Unicode codepoints in some given encoding?

As it happens, I just wrote some code in C# that can parse strings in this format for a JSON parser project -- here's a variant that only handles \uXXXX escapes:

private static string ReadSlashedString(TextReader reader) {
    var sb = new StringBuilder(32);
    bool q = false;
    while (true) {
        int chrR = reader.Read();

        if (chrR == -1) break;
        var chr = (char) chrR;

        if (!q) {
            if (chr == '\\') {
                q = true;
                continue;
            }
            sb.Append(chr);
        }
        else {
            switch (chr) {
                case 'u':
                case 'U':
                    var hexb = new char[4];
                    reader.Read(hexb, 0, 4);
                    chr = (char) Convert.ToInt32(new string(hexb), 16);
                    sb.Append(chr);
                    break;
                default:
                    throw new Exception("Invalid backslash escape (\\ + charcode " + (int) chr + ")");
            }
            q = false;
        }
    }
    return sb.ToString();
}

And you could use it like:

var str = ReadSlashedString(new StringReader("\\u5b89\\u5fbd\\u5b5f\\u5143"));

(or using a StreamReader to read from a file).

Darin Dimitrov's regexp-utilizing answer is probably faster, but I happened to have this code at hand. :)

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
AKX
  • 152,115
  • 15
  • 115
  • 172
0

UTFEncoding (or any other encoding) won't translate escape sequences like \u5b89 into the corresponding character.

The reason why it works when you pass a string constant is that the C# compiler is interpreting the escape sequences and translating them in the corresponding character before calling the decoder (actually even before the program is executed...).

You have to write code that recognizes the escape sequences and convert them into the corresponding characters.

MiMo
  • 11,793
  • 1
  • 33
  • 48
0

When you are reading "\u5b89\u5fbd\u5b5f\u5143" you get exactly what you read. The debugger escapes your strings before displaying them. The double backslashes in the string are actually single backslashes that have been escaped.

When you pass you hardcoded value, you are not actually passing in what you see on the screen. You are passing in four Unicode characters, since the C# string is unescaped by the compiler.

Darin already posted a way to unescape Unicode characters from the file, so I won't repeat it.

Kendall Frey
  • 43,130
  • 20
  • 110
  • 148
-2

I think this will give you some idea.

string str = "ivandro\u0020";
str = str.Trim();

If you try to print the string, you will notice that the space, which is \u0020, is removed.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ivandro Jao
  • 2,731
  • 5
  • 24
  • 23