4

Some UTF-8 characters like the UTF-8 equivalent of C2 96 (hyphen). On the browser it displays it as (utf box with 00 96). And not as '-'(hyphen). Any reasons for this behavior? How do we correct this?

http://stuffofinterest.com/misc/utf8.php?s=128 (Refer this URL for the codes)

I found that this can be handled with html entities. Is there any way to display this without converting to html entities?

Krishna
  • 473
  • 2
  • 7
  • 11

3 Answers3

6

The character you're talking about is an en-dash, not a hyphen. Its Unicode code point is U+2013, and its UTF-8 encoding is E2 80 93, not C2 96. That table you linked to is incorrect. The first two columns have nothing to do with UCS-2 or Unicode; they actually contain the windows-1252 encodings for the characters in question. The columns labeled "UTF-8 Hex" and "UTF-8 Native" are just plain wrong, at least for the rows labeled 128 to 159. The entities – and – represent an en-dash, but the UTF-8 sequence C2 96 represents a non-displayable control character.

You shouldn't need to encode those characters manually anyway. Just tell your text editor (or whatever you use to create the content) to save the file as UTF-8.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • I acknowledge that it is not a hyphen. But it is definitely a UTF8 character. As suggested, http://unicode.org/charts/PDF/U0080.pdf indicated that the character is "Start of Guarded Area". It displays as a hyphen when used with html entities (–) – Krishna Sep 09 '09 at 11:16
  • 2
    No, the entity `–` does represent an en-dash. It's based on windows-1252 and is therefore technically incorrect, but browsers support it for historical reasons. The correct numerical entity for en-dash, based on its Unicode code point, is `–` or `–` hex. – Alan Moore Sep 09 '09 at 13:09
  • Alan, I'm completely unnerved by your comment "technically incorrect, but browsers support it for historical reasons." How many bad mappings are there in the numerical codes for HTML entities? What if I wanted to start a, uh, guarded area in an HTML... well, never mind. But, I would appreciate it if you could point me to a list of these things. If you can assert that there is a list, I'll open a question to ask where it is. – Ion Freeman Dec 09 '13 at 15:20
  • 1
    I don't know if there's an exhaustive list, but have a look at [this table](http://www.fileformat.info/info/unicode/block/latin_supplement/list.htm). All of the characters in the `U+0080..U+009F` range are described as control characters, so the "Browser" column should be blank for those rows. Instead you see displayable characters like `ƒ` and `‰`, even though the page is served as UTF-8. If you view the page source, you'll see it's because the characters are written in the form of numeric entities (`ƒ`, `‰`). – Alan Moore Dec 10 '13 at 00:37
5

I suspect this is because the characters between U+0080 and U+009F inclusive are control characters. I'm still slightly surprised that they show differently when encoded directly in the HTML than using entities, but basically you shouldn't be using them to start with. U+0096 isn't really "hyphen", it's "start of guarded area".

See the U+0080-U+00FF code chart for more information. Basically, try to avoid control characters...

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Thanks a lot. if a program encounters this, how do we handle? I have tried this over gmail, it does not display the . It displays the "start of guarded area" as '–' Any ideas? – Krishna Sep 09 '09 at 11:05
  • How you want to handle this will depend on the application. You may want to strip the characters, or replace them with another Unicode character with similar display characteristics (e.g. use the proper hyphen character). – Jon Skeet Sep 09 '09 at 11:09
1

Two reasons come to mind:

  1. Are you sure that you have output the correct character code to the browser? Better check in some hex viewer.
  2. The font you are using doesn't have a glyph defined at this code point.
Vilx-
  • 104,512
  • 87
  • 279
  • 422