34

Say you've loaded a text file into a string, and you'd like to convert all Unicode escapes into actual Unicode characters inside of the string.

Example:

"The following is the top half of an integral character in Unicode '\u2320', and this is the lower half '\U2321'."

Daniel A. White
  • 187,200
  • 47
  • 362
  • 445
jr.
  • 1,797
  • 2
  • 15
  • 18

5 Answers5

50

The answer is simple and works well with strings up to at least several thousand characters.

Example 1:

Regex  rx = new Regex( @"\\[uU]([0-9A-F]{4})" );
result = rx.Replace( result, match => ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString() );

Example 2:

Regex  rx = new Regex( @"\\[uU]([0-9A-F]{4})" );
result = rx.Replace( result, delegate (Match match) { return ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString(); } );

The first example shows the replacement being made using a lambda expression (C# 3.0) and the second uses a delegate which should work with C# 2.0.

To break down what's going on here, first we create a regular expression:

new Regex( @"\\[uU]([0-9A-F]{4})" );

Then we call Replace() with the string 'result' and an anonymous method (lambda expression in the first example and the delegate in the second - the delegate could also be a regular method) that converts each regular expression that is found in the string.

The Unicode escape is processed like this:

((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString(); });

Get the string representing the number part of the escape (skip the first two characters).

match.Value.Substring(2)

Parse that string using Int32.Parse() which takes the string and the number format that the Parse() function should expect which in this case is a hex number.

NumberStyles.HexNumber

Then we cast the resulting number to a Unicode character:

(char)

And finally we call ToString() on the Unicode character which gives us its string representation which is the value passed back to Replace():

.ToString()

Note: Instead of grabbing the text to be converted with a Substring call you could use the match parameter's GroupCollection, and a subexpressions in the regular expression to capture just the number ('2320'), but that's more complicated and less readable.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
jr.
  • 1,797
  • 2
  • 15
  • 18
  • 2
    \u and \U should be treated differently -- \u specifies 4 hex digits (16 bits), where \U specifies 8 (32 bits) -- a unicode codepoint is 21 bits long. Also, you should use the char.ConvertFromUtf32() method rather than a cast. – Alex Lyman Oct 08 '08 at 22:18
  • I've seen \u and \U documented both ways though the current C# language specification indicates 4 hex bytes for \u and 8 hex bytes for \U. In any case, \U with only 4 hex digits is processed correctly. Have to check if ConvertFromUtf32() is functionally different from a cast. – jr. Oct 15 '08 at 07:07
  • Yeah, I read the ignorecase option in the second part of the post after realising myself. Thanks all the same. :) – Echilon Apr 08 '09 at 09:14
  • 2
    This is a brilliant answer! Just one point, in my case a-f letters were lower case so this is maybe more accurate: var rx = new Regex(@"\\[uU]([0-9A-Fa-f]{4})"); – Mo Valipour Jun 06 '11 at 23:39
  • The first part should say result = rx.Replace(... rather than result = rxx.Replace(.... I'd fix it myself but stackoverflow does not allow edits of fewer than 6 characters because the stackexchange higher ups think they know better than professionals actually using the site. – KatDevsGames Feb 21 '15 at 20:53
10

Refactored a little more:

Regex regex = new Regex (@"\\U([0-9A-F]{4})", RegexOptions.IgnoreCase);
string line = "...";
line = regex.Replace (line, match => ((char)int.Parse (match.Groups[1].Value,
  NumberStyles.HexNumber)).ToString ());
George Tsiokos
  • 1,890
  • 21
  • 31
8

This is the VB.NET equivalent:

Dim rx As New RegularExpressions.Regex("\\[uU]([0-9A-Fa-f]{4})")
result = rx.Replace(result, Function(match) CChar(ChrW(Int32.Parse(match.Value.Substring(2), Globalization.NumberStyles.HexNumber))).ToString())
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Tarık Özgün Güner
  • 1,051
  • 10
  • 10
2

add UnicodeExtensions.cs class to your project:

public static class UnicodeExtensions
{
    private static readonly Regex Regex = new Regex(@"\\[uU]([0-9A-Fa-f]{4})");

    public static string UnescapeUnicode(this string str)
    {
        return Regex.Replace(str,
            match => ((char) int.Parse(match.Value.Substring(2),
                NumberStyles.HexNumber)).ToString());
    }
}

usage:

var test = "\\u0074\\u0068\\u0069\\u0073 \\u0069\\u0073 \\u0074\\u0065\\u0073\\u0074\\u002e";
var output = test.UnescapeUnicode();   // output is => this is test.
Darzi
  • 21
  • 2
1

I think you better add the small letters to your regular expression. It worked better for me.

Regex rx = new Regex(@"\\[uU]([0-9A-Fa-f]{4})");
result = rx.Replace(result, match => ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString());
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Baseem Najjar
  • 753
  • 4
  • 8
  • 16