Using HTML Symbol Entities instead of the actual symbol

Question

Is there any particular reason I should use HTML symbol entities instead of the actual symbol (I mean the one which I can just type)? For example the symbol /; the HTML entity code for it is &#47.

Should I use the symbol's code or the symbol itself in my HTML code, and why?

You must use the symbol when you use special chars that can be misinterpreted when you change the encoding (like Ç Ã and others). Or when you dont want to interpret the char, like if you want to actually type
and not break a line — Lefsler, May 16 '13 at 18:20
possible duplicate of [When Should One Use HTML Entities](http://stackoverflow.com/questions/436615/when-should-one-use-html-entities) — Jukka K. Korpela, May 16 '13 at 19:30
This question has been asked several times in slightly varying formulations. Note that `/` is not an entity but a character reference and should (and in many contexts *shall*) contain the trailing semicolon. It is very difficult to imagine a context where you would need or want to use a reference for `/` – even if your key for it is broken, you can normally enter it in some way. — Jukka K. Korpela, May 16 '13 at 19:32

Joseph Myers · Answer 1 · 2013-05-17T17:48:42.343

Using an HTML entity reference allows the entity to be represented as intended regardless of the encoding applied to the document. That is the benefit.

Rather than strictly using entities for all non-US-ASCII characters, feel free to use an encoding for your document that supports the document's target language, preferably one also supporting other languages, like UTF-8.

However, please avoid using any system-specific encoding, especially regular Windows encoding. It is often the case that Windows-1252 text is sent to other systems with the wrong label of ISO-8859-1.

In the past there has certainly been been less reliable support for numeric HTML entities than for named HTML entities (based on my own first-person eye witness observation), but in theory a numeric HTML entity is still character encoding independent and "safe" because the numeric value refers directly to a code point registered in the UCS (http://en.wikipedia.org/wiki/Universal_Character_Set) and equivalent to its defined character name.

Caveat: the following describes my own experience, and yours may vary.

HTML documents transferred by clients for me to work on with symbols directly embedded are very often corrupted and cannot be recovered. This may be a weakness of U.S. infrastructure or a lack of knowledge on the part of my customers about how to send their documents. The infrastructure and people in a country whose primary language relies on non-ASCII characters would be much more likely to support and understand how to properly transfer their documents with no corruption.
If you are developing your own website and uploading the final copies of your own files to your server, then the risk of corruption is very small.
If you do not have control over your document from the point you edit it to the point that it is served to users, then you run the risk (perhaps not today, but certainly within recent years in the U.S., a likelihood more than mere risk) of having the document improperly converted at some point along the way and being permanently corrupted regardless of what encoding you attempt to view it in.

Numerical character references always refer to the UCS code point. As such, they *do* solve encoding compatibility problems. — deceze, May 16 '13 at 20:38
You are thinking of the paint. I am thinking of what is under the paint. Assuming you know the UCS code point, that's true. If you merely convert all the (multi-)byte values to numbers, that's not the UCS code point, but some random value. And the same is true for decoders. I doubt that the majority of browsers have a 1.2 million item database of all the code points when they decode the numeric entities. — Joseph Myers, May 16 '13 at 21:03
There are always compromises in the real world, and that's the world I'm talking about. I'm sure you're going to say nowadays everyone has software to do it properly. Ok, fine. But Stack Overflow is for people who are doing the coding of software, not people who use software to do everything. I will rewrite my answer so that it doesn't say "no compatibility benefit" but rather "is not a panacea for compatibility problems" (or something else). — Joseph Myers, May 16 '13 at 21:06
Sorry, I don't understand what you're going on about. *Assuming* you are using references correctly (by their code point, not random numbers) *and assuming* the browser actually understands numeric references correctly and supports the character in question, *then* numeric references basically get rid of encoding problems since you can keep your file encoded in pure ASCII, which virtually everything understands correctly. That's the *real world* advantage I had in mind. If you're not using references correctly or the client doesn't support the character either way, the whole point is moot. — deceze, May 17 '13 at 07:37
@deceze Thank you. We do agree. The original question asked for reasons why someone might want to use an entity rather than the actual symbol. I am answering the question. These are issues I have nearly every day (the example given in my first answer happened this week with an emailed document that was properly attached, base64 encoded, and yet was corrupted). You are admirably trying to make the world a better place by promoting useful encodings like UTF-8 and eliminating people's fear of using them. Keep on doing so, and hopefully the problems that are here now will eventually go away. — Joseph Myers, May 17 '13 at 18:02

score 0 · Answer 2 · answered May 16 '13 at 18:19

0

No.

Entities and character references are useful only if:

The character has special meaning in HTML at the point where you want to use the character (/ never will, it only has special meaning in places where you can't have a / as data anyway).
You can't type the character (e.g. because it doesn't appear on your keyboard).
You can't encode the file as UTF-8 (or in another encoding that includes it … and / appears in ASCII).

answered May 16 '13 at 18:19

Quentin

914,110
126
1,211
1,335

Didn't downvote, but... Because you "can't type the character"? If you can find out its numeric value, you can probably copy and paste it. Charmap etc. is useful too. – deceze May 16 '13 at 18:59

score -3 · Answer 3 · answered May 16 '13 at 18:37

-3

Unless you know for a fact that you will always be using the same software and computer system to edit your HTML, you will inevitably run into situations where you cannot edit your own code if you directly use symbols, regardless of what character encoding you specify in your document or with your HTTP headers. Only in a perfect world does the character encoding always properly transfer, and even then neither Macintosh nor Windows truly does it correctly.

If I open up a supposedly "properly" encoded document from either Macintosh or Windows in software that truly supports all available encoding systems, I see a message like this:

-=-J(DOS)**--F1   Top L3     (Text) ----------------------------------------
These default coding systems were tried to encode text
in the buffer:
  (iso-2022-7bit-dos (284 . 4194194) (379 . 4194194) (462 . 4194195)
  (492 . 4194196) (635 . 4194195) (640 . 4194196) (642 . 4194195) (772
  . 4194196) (833 . 4194195) (839 . 4194196) (857 . 4194195))
  (utf-8-dos (284 . 4194194) (379 . 4194194) (462 . 4194195) (492
  . 4194196) (635 . 4194195) (640 . 4194196) (642 . 4194195) (772
  . 4194196) (833 . 4194195) (839 . 4194196) (857 . 4194195))
However, each of them encountered characters it couldn't encode:
  iso-2022-7bit-dos cannot encode these: \222 \222 \223 \224 \223 \224 \223 \224 \223 \224 ...
  utf-8-dos cannot encode these: \222 \222 \223 \224 \223 \224 \223 \224 \223 \224 ...

Click on a character (or switch to this window by `C-x o'
and select the characters by RET) to jump to the place it appears,
where `C-u C-x =' will give information about it.

Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
   to remove or modify the problematic characters,
or specify any other coding system (and risk losing
   the problematic characters).

  thai-tis620

Remember that as soon as the data is off of your server, e.g., placed in an email, etc., there is no guarantee the encoding is passed along, and chances are that it is not. Byte marks and other invisible means of identifying documents do not work as promised, let alone transient methods such as HTTP headers which are lost as soon as the document moves beyond the context of your own carefully configured HTTP server.

The guiding principle of HTML is that it is a plain text markup language that, when properly used, is universally compatible with any system supporting the most basic of text. HTML documents should use HTML entities for any characters outside of the normal 7-bit US-ASCII character set. Any other characters have different binary definitions depending on the encoding used and may even vary between single-byte and multi-byte representations.

Within Non-HTML documents you can feel free to use raw symbols because when you embed them within either their native file format or within HTML you can ensure that you specify the "right" character encoding, i.e., the one that will be recognized by the system where you authored it and any system compatible with that.

answered May 16 '13 at 18:37

Joseph Myers

6,434
27
36

3

Terrible advise. This is easy to say if you're working with English sites, but dictating someone to keep all characters in a, say, Japanese document as entities makes it impossible to work with the document. We have transitioned past the era where this was an issue, thankfully! – deceze May 16 '13 at 19:03
@deceze There is no "era" going on here. Japanese is just as much a native language of computers and the internet as English, maybe more so. At least as much as you, I would love the convenience of mixing my language with my HTML, but I have the experience that tells me it is unmaintainable. The mistake in your analogy is that you assume it is still natural to write content directly inside the HTML source code. That era is over. Content and HTML/CSS have now been beautifully separated from each other. Read the end of my answer again, please. – Joseph Myers May 16 '13 at 19:33
What is the "native" format of a Japanese website? Am I *required* to get the content from somewhere else and then programmatically wrap HTML around it? Is that what you're saying? This is impossible. You will still have Japanese characters in *some form of source code*, which means at least *that* file you need to treat properly in the proper encoding. Why not directly in HTML? I have so far never had any such problems with mixed Japanese/HTML in many years. – deceze May 16 '13 at 19:38
My answer is based on experience with many languages, not one language. Compatibility problems with Japanese/HTML do exist, but they don't matter at all in your case because they only exist for people who don't speak Japanese and don't have software that has been set up to access Japanese websites. If you were forced to edit your web pages without software supporting some encoding of Japanese (e.g., UTF-8), you would be unable to do it because the software couldn't even save the document even if you were just trying to save the HTML part of it. – Joseph Myers May 16 '13 at 19:51
@deceze Why the focus again and again on Japanese? You are actually proving my point, not disproving it. If you exist in a one-language world, then you will never have problems, and don't worry about it. HTML is neither English nor Japanese. I am not arguing about languages but about the lack of control that anyone has over the encoding of their document once it is off of their own web server or opened in other editing software. The original question asked for a reason why character references were useful, and I gave an honest answer. – Joseph Myers May 16 '13 at 19:56
I am focusing on Japanese because your advice applies equally to HTML documents containing any human language, and you will hate your life if you ever try to work with an HTML document's source code where every character is a character reference. What is your practical advice for working with HTML source code as an author where a good deal or even the majority of the text content is non-ASCII (assuming you can actually read the language in question)? – deceze May 16 '13 at 19:59
You do have a valid point, and I agree that there is no real way to work with other languages without encoding the whole document in that language. Any compatibility problems won't matter for anyone in that language group. Using a named HTML reference allows the entity to be represented correctly regardless of the encoding. But if an entire document is in another language, you are right that compatibility is irrelevant because it would only be incompatible for people who can't read it anyway. – Joseph Myers May 16 '13 at 20:04
And yes, you *will* need software that actually supports the encoding and language. But that's what I mean by "era": today's tools and editors have largely caught up with the internationalizion that's happened and *can* handle encodings correctly. – deceze May 16 '13 at 20:04
Why are you saying "encode the document in that language"? Encodings have stopped to be language specific with Unicode, which is widely supported now. UTF-8 both supports virtually all existing major human languages *and* is backwards compatible with ASCII. There's no language dependent encoding problem anymore using it. – deceze May 16 '13 at 20:07
Let me rephrase: I agree that there is no real way to work with other languages without encoding the whole document in an encoding supporting that language (preferably one also supporting other languages, like UTF-8). – Joseph Myers May 16 '13 at 20:11
I can agree with that. Since this equally well applies to documents in "this" language (any ASCII based language), would you agree then that your answer regarding the use of character references doesn't really hold? – deceze May 16 '13 at 20:24

Using HTML Symbol Entities instead of the actual symbol

3 Answers3

Linked