HTML Encode ISO-8859-2 (Latin-2) characters in C#

Question

Anyone knows how to encode ISO-8859-2 charset in C#? The following example does not work:

        String name = "Filipović";
        String encoded = WebUtility.HtmlEncode(name);

The resulting string should be

"Filipovi&#263;"

Thanks

Just gotta ask... why? Are you serving documents using ISO-8859-2 to clients? If you serve them in UTF-8 instead, you shouldn't have to worry about html encoding characters like ć. — Culme, Jun 29 '17 at 16:22
When storing user inputs into the DB, characters like 'ć' get stored as 'c' without encoding — , Jun 29 '17 at 16:24
I know I'm probably not helpful now, but it sounds like you should really try to change the data type of the database column in which you store the name. Do you have any option to change the database? — Culme, Jun 29 '17 at 16:26
It's a cloud service with completely generic databases, so no .. Most clients store Latin-1, but some may store Latin-2 or even Chinese. No way to know in advance, so html encoding is required — , Jun 29 '17 at 16:29
Oh, that sounds cumbersome. I am kind of lost, since the code you provided does in fact work, only not for that specific character. I tried it with characters like å, ä and ö - which all get encoded into html. Sorry, I can't figure out a solution right now. — Culme, Jun 29 '17 at 16:39
@Culme thanks a lot for checking this out anyway, I appreciate :) — , Jun 29 '17 at 16:48
Do note in databases, using this gives very unpredictable length constraints on the database fields. I'd advise setting your database text encoding to UTF-8, where such special characters will only take 2 or 3 bytes max, instead of 6. — Nyerguds, Jul 03 '17 at 08:02
It's very much true, but all fields are VARCHAR(MAX) so there is no impact in my case (size of the database is not a concern). — , Jul 03 '17 at 08:41

György Kőszeg · Answer 1 · 2017-06-29T17:07:40.253

After reading your comments (you should support also Chinese names using ASCII chars only) I think you shouldn't stick to ISO-8859-2 encoding.

Solution 1

Use UTF-7 encoding for such names. UTF-7 is designed to use only ASCII characters for any Unicode string.

string value = "Filipović with Unicode symbol: ";
var encoded = Encoding.ASCII.GetString(Encoding.UTF7.GetBytes(value));
Console.WriteLine(encoded); // Filipovi+AQc- with Unicode symbol: +2Dzf7w-
var decoded = Encoding.UTF7.GetString(Encoding.ASCII.GetBytes(encoded));

Solution 2

Alternatively, you can use base64 encoding, too. But in this case the pure ASCII strings will not be human-readable anymore.

string value = "Filipović with Unicode symbol: ";
encoded = Convert.ToBase64String(Encoding.UTF8.GetBytes(value));
Console.WriteLine(encoded); // RmlsaXBvdmnEhyB3aXRoIFVuaWNvZGUgc3ltYm9sOiDwn4+v
var decoded = Encoding.UTF8.GetString(Convert.FromBase64String(encoded));

Solution 3

If you really stick to HTML Entity encoding you can achieve it like this:

string value = "Filipović with Unicode symbol: ";

var result = new StringBuilder();       
for (int i = 0; i < value.Length; i++)
{
    if (Char.IsHighSurrogate(value[i]))
    {
        result.Append($"&#{Char.ConvertToUtf32(value[i], value[i + 1])};");
        i++;
    }
    else if (value[i] > 127)
        result.Append($"&#{(int)value[i]};");
    else
        result.Append(value[i]);
}

Console.WriteLine(result); // Filipovi&#263; with Unicode symbol: &#127983;

Hello @taffer thanks, so basically you are telling that you can encode ANY character in any language using UTF-7 ? There is some work to do to update existing databases, but it seems very interesting. — , Jun 29 '17 at 17:01
Yes you can. UTF-7 supports full Unicode similar to any other UTF encodings. — György Kőszeg, Jun 29 '17 at 17:02

score 2 · Accepted Answer · answered Jun 29 '17 at 17:02

If you don't have strict requirement on Html encoding I'd recommend using Url (%) encoding which encodes all non-ASCII characters:

String name = "Filipović";
String encoded = WebUtility.UrlEncode(name); // Filipovi%C4%87

If you must have string with all non-ASCII characters to be HTML encoded consistently your best bet is use &xNNNN; or &#NNNN; format to encode all characters above 127. Unfortunately there is no way to convience HtmlEncode to encode all characters, so you need to do it yourself i.e. similarly how it is done in Convert a Unicode string to an escaped ASCII string. You can continue using HtmlDecode to read the values back at it handles &#xNNNN just fine.

Non optimal sample:

  var name = "Filipović";
  var result = String.Join("", 
     name.Select(x => x < 127 ? x.ToString() : String.Format("&#x{0:X4}", (int)x))
  );

Hello @Alexei thanks fot the tip, it looks very interesting I will try that out. — , Jun 29 '17 at 17:14
Your "non-optimal" sample works great, are letters are indeed encoded, and it is easy to decode by just using HtmlDecode. — , Jun 30 '17 at 10:15
Excellent answer! In reference to the post, and to my problem, the format should be: "{0:G};". — Gerard Jaryczewski, Apr 05 '23 at 15:22

HTML Encode ISO-8859-2 (Latin-2) characters in C#

2 Answers2