Replace Unicode escape sequences in a string

Question

We have one text file which has the following text

"\u5b89\u5fbd\u5b5f\u5143"

When we read the file content in C# .NET it shows like:

"\\u5b89\\u5fbd\\u5b5f\\u5143"

Our decoder method is

public string Decoder(string value)
{
    Encoding enc = new UTF8Encoding();
    byte[] bytes = enc.GetBytes(value);
    return enc.GetString(bytes);
}

When I pass a hard coded value,

string Output=Decoder("\u5b89\u5fbd\u5b5f\u5143");

it works well, but when we use a variable value it is not working.

When we use the string this is what we get from the text file:

  value=(text file content)
  string Output=Decoder(value);

It returns the wrong output.

How can I fix this?

score 17 · Answer 1 · edited Jul 09 '21 at 16:49

17

Use the below code. This unescapes any escaped characters from the input string

Regex.Unescape(value);

edited Jul 09 '21 at 16:49

Peter Mortensen

30,738
21
105
131

answered May 14 '14 at 08:50

Sagar

399
4
11

Thank you! I couldn't figure out why my WebClient wasn't properly outputting unicode characters in my string. I didn't even think about the /u being an escape character in the string until I saw your post and it clicked. – CJF Jan 25 '20 at 20:59

score 8 · Accepted Answer · edited Jul 09 '21 at 16:41

8

You could use a regular expression to parse the file:

private static Regex _regex = new Regex(@"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);

public string Decoder(string value)
{
    return _regex.Replace(
        value,
        m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
    );
}

And then:

string data = Decoder(File.ReadAllText("test.txt"));

edited Jul 09 '21 at 16:41

Peter Mortensen

30,738
21
105
131

answered Mar 16 '12 at 13:46

Darin Dimitrov

1,023,142
271
3,287
2,928

First of all thanks for your reply,When i use this code i am getting compilation error. 1 Cannot convert lambda expression to type 'string' because it is not a delegate Error. 2 The name 'NumberStyles' does not exist in the current context – PrateekSaluja Mar 16 '12 at 14:14
Can you please tell me what mistake i made? – PrateekSaluja Mar 16 '12 at 14:15
Which .NET version are you using? The `NumberStyles` enumeration is defined in the `System.Globalization` namespace so make sure you have referenced it. – Darin Dimitrov Mar 16 '12 at 14:16
Thanks @Darin I have voted for you.if you can resolve the compilation error then it would be great for us. – PrateekSaluja Mar 16 '12 at 14:20
I am using 3.5 framework – PrateekSaluja Mar 16 '12 at 14:21
@PrateekSaluja, OK, then add `using System.Globalization` to the top of your file. – Darin Dimitrov Mar 16 '12 at 14:21
Ohh,Yes I am so sorry about that,That works even very faster.Thank you so much for your time & code. – PrateekSaluja Mar 16 '12 at 14:23

score 3 · Answer 3 · edited Jul 09 '21 at 16:46

So your file contains the verbatim string

\u5b89\u5fbd\u5b5f\u5143

in ASCII and not the string represented by those four Unicode codepoints in some given encoding?

As it happens, I just wrote some code in C# that can parse strings in this format for a JSON parser project -- here's a variant that only handles \uXXXX escapes:

private static string ReadSlashedString(TextReader reader) {
    var sb = new StringBuilder(32);
    bool q = false;
    while (true) {
        int chrR = reader.Read();

        if (chrR == -1) break;
        var chr = (char) chrR;

        if (!q) {
            if (chr == '\\') {
                q = true;
                continue;
            }
            sb.Append(chr);
        }
        else {
            switch (chr) {
                case 'u':
                case 'U':
                    var hexb = new char[4];
                    reader.Read(hexb, 0, 4);
                    chr = (char) Convert.ToInt32(new string(hexb), 16);
                    sb.Append(chr);
                    break;
                default:
                    throw new Exception("Invalid backslash escape (\\ + charcode " + (int) chr + ")");
            }
            q = false;
        }
    }
    return sb.ToString();
}

And you could use it like:

var str = ReadSlashedString(new StringReader("\\u5b89\\u5fbd\\u5b5f\\u5143"));

(or using a StreamReader to read from a file).

Darin Dimitrov's regexp-utilizing answer is probably faster, but I happened to have this code at hand. :)

thanks its working,I tried of Darin's code but getting some compilation issue.Any Way thanks a lot for this code. — PrateekSaluja, Mar 16 '12 at 14:19
thank you so much.. i was so struggled last one day. now i got it.. thanks again — PrabhuPrakash, Nov 23 '17 at 14:34

score 0 · Answer 4 · answered Mar 16 '12 at 13:44

UTFEncoding (or any other encoding) won't translate escape sequences like \u5b89 into the corresponding character.

The reason why it works when you pass a string constant is that the C# compiler is interpreting the escape sequences and translating them in the corresponding character before calling the decoder (actually even before the program is executed...).

You have to write code that recognizes the escape sequences and convert them into the corresponding characters.

score 0 · Answer 5 · answered Mar 16 '12 at 13:48

When you are reading "\u5b89\u5fbd\u5b5f\u5143" you get exactly what you read. The debugger escapes your strings before displaying them. The double backslashes in the string are actually single backslashes that have been escaped.

When you pass you hardcoded value, you are not actually passing in what you see on the screen. You are passing in four Unicode characters, since the C# string is unescaped by the compiler.

Darin already posted a way to unescape Unicode characters from the file, so I won't repeat it.

score -2 · Answer 6 · edited Jul 09 '21 at 16:51

-2

I think this will give you some idea.

string str = "ivandro\u0020";
str = str.Trim();

If you try to print the string, you will notice that the space, which is \u0020, is removed.

edited Jul 09 '21 at 16:51

Peter Mortensen

30,738
21
105
131

answered Jun 08 '14 at 01:38

Ivandro Jao

2,731
5
24
23

Replace Unicode escape sequences in a string

6 Answers6

Linked

Related