4

I have some string that contains the following code/value:

"You won\u0092t find a ...."

enter image description here

It looks like that string contains the Right Apostrophe special character.

I'm not sure how to display this to the webbrowser. It keeps displaying the TOFU square-box character instead. I'm under the impression that the unicode (hex) value 00092 can be converted to unicode (html) ’

Is my understanding correct?


Update 1:

It was suggested by @sam-axe that I HtmlEncode the unicode. That didn't work. Here it is...

enter image description here

Note the ampersand got correctly encoded....

Pure.Krome
  • 84,693
  • 113
  • 396
  • 647

2 Answers2

5

It looks like there's an encoding mix-up. In .NET, strings are normally encoded as UTF-16, and a right apostrophe should be represented as \u2019. But in your example, the right apostrophe is represented as \x92, which suggests the original encoding was Windows code page 1252. If you include your string in a Unicode document, the character \x92 won't be interpreted properly.

You can fix the problem by re-encoding your string as UTF-16. To do so, treat the string as an array of bytes, and then convert the bytes back to Unicode using the 1252 code page:

string title = "You won\u0092t find a cheaper apartment * Sauna & Spa";
byte[] bytes = title.Select(c => (byte)c).ToArray();
title = Encoding.GetEncoding(1252).GetString(bytes);
// Result: "You won’t find a cheaper apartment * Sauna & Spa"
Michael Liu
  • 52,147
  • 13
  • 117
  • 150
  • Good thinking - I didn't consider an encoding mix-up. – xxbbcc Sep 13 '17 at 04:11
  • Interesting point. The _original_ data was actually `’` in some `XML` document. I then used `XDocument.Parse()` to load in this file-text data into an XML document. The file does NOT have the `UTF-8` declaration at the top, though. Could this be related to the issue? – Pure.Krome Sep 13 '17 at 04:18
  • An XML file that lacks an encoding declaration [must](http://www.w3.org/TR/REC-xml/#charencoding) use UTF-8 (or UTF-16 if a UTF-16 byte-order mark is present). So the actual problem is that whoever created this document used the wrong character entity to refer to a quotation mark. If you're unable to get the source of the problem corrected, and if this is the only affected character, then a simpler solution might be to manually replace `"’"` by `"’"` using String.Replace before you parse the document. – Michael Liu Sep 13 '17 at 13:48
1

Note: much of my answer is based on guessing and looking at the decompiled code of System.Web 4.0. The reference source looks very similar (identical?).

You're correct that "’" (6 characters) can be displayed in the browser. Your output string, however, contains "\u0092" (1 character). This is a control character, not an HTML entity.

According to the reference code, WebUtility.HtmlEncode() doesn't transform characters between 128 and 160 - all characters in this range are control characters (ampersand is special-cased in the code as are a few other special HTML symbols).

My guess is that because these are control characters, they're output without transformation because transforming it would change the meaning of the string. (I tried running some examples using LinqPad, this character was not rendered.)

If you really want to transform these characters (or remove them), you'll probably have to write your own function before/after calling HtmlEncode() - there may be something that does this already but I don't know of any.

Hope this helps.

Edit: Michael Liu's answer seems correct. I'm leaving my answer here because it may be useful in cases when the input encoding of a string is not known.

xxbbcc
  • 16,930
  • 5
  • 50
  • 83
  • I would prefer to remove them. So are you suggesting that I keep characters 20-127 & 160-255? The rest are stripped... – Pure.Krome Sep 13 '17 at 02:57
  • @Pure.Krome Yes, that's probably what I'd do. I'm not entirely sure that that's the "right" solution but HtmlEncode() definitely won't deal with those characters. I guess it's up to the client that displays the string to deal with it - a desktop textbox doesn't render it and the browser puts the boxy character in place. – xxbbcc Sep 13 '17 at 02:59
  • @Pure.Krome Just saw your comment edit - yes, the 160-255 range seems safe for display, although you could just encode all characters over 160 into an HTML number code to be sure. – xxbbcc Sep 13 '17 at 03:02
  • This answer helps with what you are saying @xxbbcc : https://stackoverflow.com/a/14323524/30674 – Pure.Krome Sep 13 '17 at 03:16
  • @Pure.Krome Correct, although I'd probably use a `for` loop instead of a regex. Michael Liu's answer looks correct, though - my answer may be useful when the input encoding is not known. – xxbbcc Sep 13 '17 at 04:14