3

If I have a string of UTF-8 characters and they need to be output to an older system as UTF-7 I have two questions pertaining to this.

  1. How can I convert a string s which has UTF-8 characters to the same string without those characters efficiently?

  2. Are there any simple of converting extended characters like 'Ō' to their closest non extended equivalent 'O'?

Mechanical snail
  • 29,755
  • 14
  • 88
  • 113
maxfridbe
  • 5,872
  • 10
  • 58
  • 80

1 Answers1

6

If the older system can actually handle UTF-7 properly, why do you want to remove anything? Just encode the string as UTF-7:

string text = LoadFromWherever(Encoding.UTF8);
byte[] utf7 = Encoding.UTF7.GetBytes(text);

Then send the UTF-7-encoded text down to the older system.

If you've got the original UTF-8-encoded bytes, you can do this in one step:

byte[] utf7 = Encoding.Convert(Encoding.UTF8, Encoding.UTF7, utf8);

If you actually need to convert to ASCII, you can do this reasonably easily.

To remove the non-ASCII characters:

var encoding = Encoding.GetEncoding
    ("us-ascii", new EncoderReplacementFallback(""), 
     new DecoderReplacementFallback(""));
byte[] ascii = encoding.GetBytes(text);

To convert non-ASCII to nearest equivalent:

string normalized = text.Normalize(NormalizationForm.FormKD);
var encoding = Encoding.GetEncoding
    ("us-ascii", new EncoderReplacementFallback(""), 
     new DecoderReplacementFallback(""));
byte[] ascii = encoding.GetBytes(normalized);
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194