What's the difference between typing the Encoding of a Unicode character or just copying the character?

Question

For example, if I want the bullet point character in my HTML page, I could either type out • or just copy paste •. What's the real difference?

The first one is an HTML entity which gets converted to the real character in HTML… Related: [HTML5: which is better - using a character entity vs using a character directly?](https://stackoverflow.com/q/9808098/4642212). — Sebastian Simon, Jul 09 '18 at 01:09
If you insert that in mysql, sometimes ñ ◄, •, and other chars became ?? inside the database, so I use ñ for ñ. — bdalina, Jul 09 '18 at 01:19
@bdalina That's only if you're Doing It Wrong™, the database doesn't randomly willy-nilly decide to replace characters. — deceze, Jul 09 '18 at 13:18

score 2 · Answer 1 · answered Jul 09 '18 at 13:16

2

A character entity reference such as • works indepedently of the document encoding. It takes up more octets in the source (here: 7).

A character such as • works only with the precise encoding declared with the document. It takes up less octets in the source (here: 3, assuming UTF-8).

answered Jul 09 '18 at 13:16

daxim

39,270
4
65
132

1

"Works independently of the document encoding": While this is true, processing the document as text without using the precise document encoding it was written with is Doing It Wrong™ (@deceze). – Tom Blodget Jul 09 '18 at 14:26

score 2 · Accepted Answer · edited Jul 09 '18 at 19:35

≺ is a sequence of 7 ASCII characters: ampersand (&), number sign (#), eight (8), eight (8), two (2), six (6), semicolon (;).

• is 1 single bullet point character.

That is the most obvious difference.

The former is not a bullet point. It's a string of characters that an HTML browser would parse to produce the final bullet point that is rendered to the user. You will always be looking at this string of ASCII characters whenever you look at your HTML's source code.

The latter is exactly the bullet point character that you want, and it's clear and precise to understand when you look at it.

Now, ≺ uses only ASCII characters, and so the file they are in can be encoded using pure ASCII, or any compatible encoding. Since ASCII is the de-facto basis of virtually all common encodings, this means you don't need to worry much about the file encoding and you can blissfully ignore that part of working with text files and you'll probably never run into any issues.

However, ≺ is only meaningful in HTML. It remains just a string of ASCII characters in the context of a database, a plain-text email, or any other non-HTML situation.

•, on the other hand, is not a character that can be encoded in ASCII, so you need to consciously choose an encoding which can represent that character (like UTF-8), and you need to ensure that you're sending the correct metadata to ensure that clients interpret the encoding correctly as well (HTTP headers, HTML <meta> tags, etc). See UTF-8 all the way through.

But • means • in any context, plain-text or otherwise, and does not need to be specifically HTML-interpreted.

Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

What's the difference between typing the Encoding of a Unicode character or just copying the character?

2 Answers2