How do I convert Unicode escape sequences to Unicode characters in a .NET string?

Question

Say you've loaded a text file into a string, and you'd like to convert all Unicode escapes into actual Unicode characters inside of the string.

Example:

"The following is the top half of an integral character in Unicode '\u2320', and this is the lower half '\U2321'."

score 50 · Accepted Answer · edited Jul 01 '17 at 14:07

The answer is simple and works well with strings up to at least several thousand characters.

Example 1:

Regex  rx = new Regex( @"\\[uU]([0-9A-F]{4})" );
result = rx.Replace( result, match => ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString() );

Example 2:

Regex  rx = new Regex( @"\\[uU]([0-9A-F]{4})" );
result = rx.Replace( result, delegate (Match match) { return ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString(); } );

The first example shows the replacement being made using a lambda expression (C# 3.0) and the second uses a delegate which should work with C# 2.0.

To break down what's going on here, first we create a regular expression:

new Regex( @"\\[uU]([0-9A-F]{4})" );

Then we call Replace() with the string 'result' and an anonymous method (lambda expression in the first example and the delegate in the second - the delegate could also be a regular method) that converts each regular expression that is found in the string.

The Unicode escape is processed like this:

((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString(); });

Get the string representing the number part of the escape (skip the first two characters).

match.Value.Substring(2)

Parse that string using Int32.Parse() which takes the string and the number format that the Parse() function should expect which in this case is a hex number.

NumberStyles.HexNumber

Then we cast the resulting number to a Unicode character:

(char)

And finally we call ToString() on the Unicode character which gives us its string representation which is the value passed back to Replace():

.ToString()

Note: Instead of grabbing the text to be converted with a Substring call you could use the match parameter's GroupCollection, and a subexpressions in the regular expression to capture just the number ('2320'), but that's more complicated and less readable.

\u and \U should be treated differently -- \u specifies 4 hex digits (16 bits), where \U specifies 8 (32 bits) -- a unicode codepoint is 21 bits long. Also, you should use the char.ConvertFromUtf32() method rather than a cast. — Alex Lyman, Oct 08 '08 at 22:18
I've seen \u and \U documented both ways though the current C# language specification indicates 4 hex bytes for \u and 8 hex bytes for \U. In any case, \U with only 4 hex digits is processed correctly. Have to check if ConvertFromUtf32() is functionally different from a cast. — jr., Oct 15 '08 at 07:07
Yeah, I read the ignorecase option in the second part of the post after realising myself. Thanks all the same. :) — Echilon, Apr 08 '09 at 09:14
This is a brilliant answer! Just one point, in my case a-f letters were lower case so this is maybe more accurate: var rx = new Regex(@"\\[uU]([0-9A-Fa-f]{4})"); — Mo Valipour, Jun 06 '11 at 23:39
The first part should say result = rx.Replace(... rather than result = rxx.Replace(.... I'd fix it myself but stackoverflow does not allow edits of fewer than 6 characters because the stackexchange higher ups think they know better than professionals actually using the site. — KatDevsGames, Feb 21 '15 at 20:53

George Tsiokos · Answer 2 · 2014-08-28T17:50:14.373

10

Refactored a little more:

Regex regex = new Regex (@"\\U([0-9A-F]{4})", RegexOptions.IgnoreCase);
string line = "...";
line = regex.Replace (line, match => ((char)int.Parse (match.Groups[1].Value,
  NumberStyles.HexNumber)).ToString ());

edited Aug 28 '14 at 17:50

answered Jan 20 '09 at 18:54

George Tsiokos

1,890
21
31

score 8 · Answer 3 · edited Jul 01 '17 at 14:09

8

This is the VB.NET equivalent:

Dim rx As New RegularExpressions.Regex("\\[uU]([0-9A-Fa-f]{4})")
result = rx.Replace(result, Function(match) CChar(ChrW(Int32.Parse(match.Value.Substring(2), Globalization.NumberStyles.HexNumber))).ToString())

edited Jul 01 '17 at 14:09

Peter Mortensen

30,738
21
105
131

answered Oct 30 '12 at 15:36

Tarık Özgün Güner

1,051
10
10

score 2 · Answer 4 · answered Dec 02 '21 at 10:22

add UnicodeExtensions.cs class to your project:

public static class UnicodeExtensions
{
    private static readonly Regex Regex = new Regex(@"\\[uU]([0-9A-Fa-f]{4})");

    public static string UnescapeUnicode(this string str)
    {
        return Regex.Replace(str,
            match => ((char) int.Parse(match.Value.Substring(2),
                NumberStyles.HexNumber)).ToString());
    }
}

usage:

var test = "\\u0074\\u0068\\u0069\\u0073 \\u0069\\u0073 \\u0074\\u0065\\u0073\\u0074\\u002e";
var output = test.UnescapeUnicode();   // output is => this is test.

score 1 · Answer 5 · edited Jul 01 '17 at 14:08

1

I think you better add the small letters to your regular expression. It worked better for me.

Regex rx = new Regex(@"\\[uU]([0-9A-Fa-f]{4})");
result = rx.Replace(result, match => ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString());

edited Jul 01 '17 at 14:08

Peter Mortensen

30,738
21
105
131

answered Jul 04 '12 at 14:25

Baseem Najjar

753
4
8
16

How do I convert Unicode escape sequences to Unicode characters in a .NET string?

5 Answers5

Linked

Related