3

I need to convert non alpha-numeric glyphs in a string to their unicode value, while preserving the alphanumeric characters. Is there a method to do this in C#?

As an example, I need to convert this string:

"hello world!"

To this:

"hello_x0020_world_x0021_"

Michał Turczyn
  • 32,028
  • 14
  • 47
  • 69
Tyler Jones
  • 1,283
  • 4
  • 18
  • 38

2 Answers2

2

To get string safe for XML node name you should use XmlConver.EncodeName.

Note that if you need to encode all non-alphanumeric characters you'd need to write it yourself as "_" is not encoded by that method.

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
1

You could start with this code using LINQ Select extension method:

  string str = "hello world!";
  string a = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
  a += a.ToLower();

  char[] alphabet = a.ToCharArray();

  str = string.Join("",
    str.Select(ch => alphabet.Contains(ch) ? 
         ch.ToString() : String.Format("_x{0:x4}_", ch)).ToArray()
  );

Now clearly it has some problems:

  • it does linear search in the list of characters
  • missed numeric...
  • if we add numeric need to decide if first character is ok to be digit (assuming yes)
  • code creates large number of strings that are immediately discarded (one per character)
  • alphanumeric is limited to ASCII (assuming ok, if not Char.IsLetterOrDigit to help)
  • does to much work for pure alpha-numeric strings

First two are easy - we can use HashSet (O(1) Contains) initialized by full list of characters (if any alpahnumeric characters are ok more readable to use existing method - Char.IsLetterOrDigit):

public static HashSet<char> asciiAlphaNum = new HashSet<char>
       ("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");

To avoid ch.ToString() that really pointlessly produces strings for immediate GC we need to figure out how to construct string from mix of char and string. String.Join does not work because it wants strings to start with, regular new string(...) does not have option for mix of char and string. So we are left with StringBuilder that happily takes both to Append. Consider starting with initial size str.Length if most strings don't have other characters.

So for each character we just need to either builder.Append(ch) or builder.AppendFormat(("_x{0:x4}_", (int)ch). To perform iteration it is easier to just use regular foreach, but if one really wants LINQ - Enumerable.Aggregate is the way to go.

string ReplaceNonAlphaNum(string str)
{
   var builder = new StringBuilder(); 
   foreach (var ch in str)
   {
       if (asciiAlphaNum.Contains(ch))
             builder.Append(ch);
       else
             builder.AppendFormat("_x{0:x4}_", (int)ch);
   }
   return builder.ToString();    
}

string ReplaceNonAlphaNumLinq(string str)
{
   return str.Aggregate(new StringBuilder(), (builder, ch) => 
       asciiAlphaNum.Contains(ch) ? 
          builder.Append(ch) : builder.AppendFormat("_x{0:x4}_", (int)ch)           
   ).ToString();
}

To the last point - we don't really need to do anything if there is nothing to convert - so some check like check alphanumeric characters in string in c# would help to avoid extra strings.

Thus final version (LINQ as it is a bit shorter and fancier):

private static asciiAlphaNumRx = new Regex(@"^[a-zA-Z0-9]*$");
public static HashSet<char> asciiAlphaNum = new HashSet<char>
       ("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");

string ReplaceNonAlphaNumLinq(string str)
{
   return asciiAlphaNumRx.IsMatch(str) ? str :
       str.Aggregate(new StringBuilder(), (builder, ch) => 
          asciiAlphaNum.Contains(ch) ? 
             builder.Append(ch) : builder.AppendFormat("_x{0:x4}_", (int)ch)            
       ).ToString();
}

Alternatively whole thing could be done with Regex - see Regex replace: Transform pattern with a custom function for starting point.

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
Michał Turczyn
  • 32,028
  • 14
  • 47
  • 69
  • Looks very half-done... should use `new HashSet("az09AZ")` and `String.Format("_x{0:x4}_", ch)`... or maybe use Regex replace with callback... – Alexei Levenkov Mar 10 '19 at 22:36
  • @AlexeiLevenkov Thanks for remarks. But I don't understand why it is half-way done. It gives expected result. – Michał Turczyn Mar 11 '19 at 07:36
  • I've made some minor edit to the post showing what I mean since it did not fit into a comment. Feel free to revert. If you fine with the change I think now it covers a all OP may be interested in and deserves upvote :) – Alexei Levenkov Mar 12 '19 at 06:40