45

We are hosting PHP apps on a Debian-based LAMP installation. Everything is quite OK – performance-, administrative-, and management-wise. However, being somewhat new developers (we're still in high school) we've run into some problems with the character encoding for Western character sets.

After doing a lot of research, I have come to the conclusion that the information online is somewhat confusing. It's talking about Windows-1252 being ANSI and totally ISO-8859-1 compatible.

So anyway, what is the difference between Windows-1252(1/3/4) and ISO-8859-1?
And where does ANSI come into this, anyway?

What encoding should we use on our Debian servers (and workstations) in order to ensure that clients get all information in the intended way, and that we don't lose any characters on the way?

Henke
  • 4,445
  • 3
  • 31
  • 44

5 Answers5

41

I'd like to answer this in a more web-like manner and in order to answer it so we need a little history. Joel Spolsky has written a very good introductionary article on the absolute minimum every dev should know on Unicode Character Encoding. Bear with me here because this is going to be somewhat of a looong answer. :)

As a history I'll point to some quotes from there: (Thank you very much Joel! :) )

The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes.

And all was good, assuming you were an English speaker. Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255.

So now "OEM character sets" were distributed with PCs and these were still all different and incompatible. And to our contemporary amazement - it was all fine! They didn't have the Internet back than and people rarely exchanged files between systems with different locales.

Joel goes on saying:

In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.

And this is how the "Windows Code pages" were born, eventually. They were actually "parented" by the DOS code pages. And then Unicode was born! :) and UTF-8 is "another system for storing your string of Unicode code points" and actually "every code point from 0-127 is stored in a single byte" and is the same as ASCII. I will not go into anymore specifics of Unicode and UTF-8, but you should read up on the BOM, Endianness and Character Encoding as a general.

On "the ANSI conspiracy", Microsoft actually admits the miss-labeling of Windows-1252 in a glossary of terms:

The so-called Windows character set (WinLatin1, or Windows code page 1252, to be exact) uses some of those positions for printable characters. Thus, the Windows character set is NOT identical with ISO 8859-1. The Windows character set is often called "ANSI character set", but this is SERIOUSLY MISLEADING. It has NOT been approved by ANSI.

So, ANSI when refering to Windows character sets is not ANSI-certified! :)

As Jukka pointed out (credits go to you for the nice answer )

Windows-1252 ISO Latin 1, also known as ISO-8859-1 as a character encoding, so that the code range 0x80 to 0x9F is reserved for control characters in ISO-8859-1 (so-called C1 Controls), wheres in Windows-1252, some of the codes there are assigned to printable characters (mostly punctuation characters), others are left undefined.

However my personal opinion and technical understanding is that both Windows-1252 and ISO-8859-1 ARE NOT WEB ENCODINGS! :) So:

  • For web pages please use UTF-8 as encoding for the content So store data as UTF-8 and "spit it out" with the HTTP Header: Content-Type: text/html; charset=utf-8.

    There is also a thing called the HTML content-type meta-tag: <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> Now, what browsers actually do when they encounter this tag is that they start from the beginning of the HTML document again so that they could reinterpret the document in the declared encoding. This should happen only if there is no 'Content-type' header.

  • Use other specific encodings if the users of your system need files generated from it. For example some western users may need Excel generated files, or CSVs in Windows-1252. If this is the case, encode text in that locale and then store it on the fs and serve it as a download-able file.

  • There is another thing to be aware of in the design of HTTP: The content-encoding distribution mechanism should work like this.

    I. The client requests a web page in a specific content-types and encodings via: the 'Accept' and 'Accept-Charset' request headers.

    II. Then the server (or web application) returns the content trans-coded to that encoding and character set.

This is NOT THE CASE in most modern web apps. What actually happens it that web applications serve (force the client) content as UTF-8. And this works because browsers interpret received documents based on the response headers and not on what they actually expected.

We should all go Unicode, so please, please, please use UTF-8 to distribute your content wherever possible and most of all applicable. Or else the elders of the Internet will haunt you! :)

P.S. Some more nice articles on using MS Windows characters in Web Pages can be found here and here.

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
Borislav Sabev
  • 4,776
  • 1
  • 24
  • 30
  • 1
    Thanks for such a great answer. Using your links I've created this array of all possible encoding strings, not sure if useful but sharing the link for all random googlers like myself: https://gist.github.com/liesislukas/d7c4bcd0e8b83aef084d8d269fbd7ba7 – Lukas Liesis May 14 '17 at 20:17
  • 1
    @LukasLiesis nice to know it helped – Borislav Sabev May 17 '17 at 10:11
17

The most authoritative reference to meanings of character encoding names is the IANA registry Character Sets.

Windows-1252 is commonly known as Windows Latin 1 or as Windows West European or something like that. It differs from ISO Latin 1, also known as ISO-8859-1 as a character encoding, so that the code range 0x80 to 0x9F is reserved for control characters in ISO-8859-1 (so-called C1 Controls), wheres in Windows-1252, some of the codes there are assigned to printable characters (mostly punctuation characters), others are left undefined.

ANSI comes here as a misnomer. Microsoft once submitted Windows-1252 to American National Standards Institute (ANSI) to be adopted as a standard; the proposal was rejected, but Microsoft still calls their code “ANSI”. For further confusion, they may use “ANSI” for different encodings (basically, the “native 8-bit encoding” of a Windows installation).

In the web context, declaring ISO-8859-1 will be taken as if you declared Windows-1252. The reason is that C1 Controls are not used, or useful, on the web, whereas the added characters are often used, even on pages mislabelled as ISO-8859-1. So in practical terms it does not matter which one you declare.

There might still be some browsers that actually interpret data as ISO-8859-1 if declared so, but they must be very rare (the last I remember seeing was a version of Opera about ten years ago).

You do not describe what problems you have encountered. The most common cause of problems seems to be that data is actually UTF-8 encoded but declared as ISO-8859-1 (or Windows-1252), or vice versa. This becomes a real problem to web page authors if a server forces a Content-Type header declaring a character encoding and it is one that they cannot deal with in their authoring environment (or don’t know how to do that).

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
14

This table gives an overview about the differences. It shows all characters which are defined in Windows-1252 but not available in ISO-8859-1/ISO-8859-15:

        │  …0  │  …1  │  …2  │  …3  │  …4  │  …5  │  …6  │  …7  │  …8  │  …9  │  …A  │  …B  │  …C  │  …D  │  …E  │  …F  │
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     8… │   €  │      │   ‚  │   ƒ  │   „  │   …  │   †  │   ‡  │   ˆ  │   ‰  │   Š  │   ‹  │   Œ  │      │   Ž  │      │
Unicode │ 20AC │      │ 201A │ 0192 │ 201E │ 2026 │ 2020 │ 2021 │ 02C6 │ 2030 │ 0160 │ 2039 │ 0152 │      │ 017D │      │
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     9… │      │  ‘   │   ’  │   “  │   ”  │   •  │   –  │   —  │   ˜  │   ™  │   š  │   ›  │   œ  │      │   ž  │   Ÿ  │
Unicode │      │ 2018 │ 2019 │ 201C │ 201D │ 2022 │ 2013 │ 2014 │ 02DC │ 2122 │ 0161 │ 203A │ 0153 │      │ 017E │ 0178 │

Unlike Windows-1252 range 0x80…0x9F is used for Control Codes in ISO-8859-1.

This table shows the differences between Windows-1252, ISO-8859-1 and ISO-8859-15

Character    │    € │   Š │   š │   Ž │   ž │   Œ │   œ │   Ÿ │  ¤ │  ¦ │  ¨ │  ´ │  ¸ │  ¼ │  ½ │  ¾ │
───────────────────────────────────────────────────────────────────────────────────────────────────────
ISO 8859-1   │    – │   – │   – │   – │   – │   – │   – │   – │ A4 │ A6 │ A8 │ B4 │ B8 │ BC │ BD │ BE │
ISO 8859-15  │   A4 │  A6 │  A8 │  B4 │  B8 │  BC │  BD │  BE │  – │  – │  – │  – │  – │  – │  – │  – │
Windows-1252 │   80 │  8A │  9A │  8E │  9E │  8C │  9C │  9F │ A4 │ A6 │ A8 │ B4 │ B8 │ BC │ BD │ BE │
Unicode      │ 20AC │ 160 │ 161 │ 17D │ 17E │ 152 │ 153 │ 178 │ A4 │ A6 │ A8 │ B4 │ B8 │ BC │ BD │ BE │
Wernfried Domscheit
  • 54,457
  • 9
  • 76
  • 110
5

ANSI (Windows-1252) in countries with an english/latin alphabet, e.g. UK/US/France/Germany and others, refers to the Windows-1252 encoding. https://web.archive.org/web/20170916200715/http://www.microsoft.com:80/resources/msdn/goglobal/default.mspx

Windows-1252. and ISO-8859-1 are very similar. They only differ in 32 characters.

In Windows-1252, the characters from 128 to 159 are used for some useful characters such as the Euro symbol.

In ISO-8859-1 these characters are mapped to control characters which are useless in HTML.

__ so a suggestion so see if 128 is euro symbol.. if it is it's Windows 1252. __

The codes from 128 to 159 are not in use in ISO-8859-1, but many browsers will display the characters from the Windows-1252) character set instead of nothing.

These 2 links list them both.

http://www.w3schools.com/charsets/ref_html_ansi.asp

http://www.w3schools.com/charsets/ref_html_8859.asp

Some comments were very useful and I amended my post accordingly based on them.

Chenfeng points out On Windows, "ANSI" refers to the system codepage specified by the locale, whatever that is (Arabic/Chinese/Cyrillic/Vietnamese/...). It does not [necessarily] refer.. to Windows-1252. You can test this by changing your locale and then use notepad.exe to save a text file in "ANSI". According to this MS documentation, there are 14 different "ANSI" code pages https://learn.microsoft.com/en-us/windows/desktop/intl/code-page-identifiers

Wernfriend points out https://web.archive.org/web/20170916200715/http://www.microsoft.com:80/resources/msdn/goglobal/default.mspx and that usa codepage 437 is the 'OEM codepage', (see OEM column), and the OEM codepage is the one used by the cmd prompt. And he points out / suggests, showing from that webpage, that in many non-english/latin-alphabet speaking countries ansi is not windows 1252. I notice that for example, hebrew ansi uses 1255. (hebrew OEM codepage is 862).

barlop
  • 12,887
  • 8
  • 80
  • 109
  • 2
    I think **"ANSI is also called Windows-1252"** is only valid on a "western" PC. In other regions "ANSI" might something different, see [National Language Support (NLS) API Reference](https://web.archive.org/web/20171015144140/https://www.microsoft.com/resources/msdn/goglobal/default.mspx) – Wernfried Domscheit Feb 22 '18 at 09:37
  • @WernfriedDomscheit hmm.. west europe is kind of western, and outside europe(geographically outside of it, though part of it politically from 1973-recent), but still western, they have codepage 850 as opposed to usa codepage 437. I'll look into the idea that there are different character sets referred to as ANSI. – barlop Feb 22 '18 at 10:08
  • Also, Apparently there are also significant differences I didn't mention between ISO 8859-1 and Windows-1252 https://en.wikipedia.org/wiki/Windows-1252 "It is very common to mislabel Windows-1252 text with the charset label ISO-8859-1. A common result was that all the quotes and apostrophes (produced by "smart quotes" in word-processing software) were replaced with question marks or boxes on non-Windows operating systems, making text difficult to read" – barlop Feb 22 '18 at 10:08
  • 1
    You missed column "ANSI codepage" with "OEM codepage". For most countries/regions the ANSI codepage is 1252, however there are some others. "OEM" is the default code page when you launch the `cme.exe`. – Wernfried Domscheit Feb 22 '18 at 10:11
  • 2
    On Windows, "ANSI" refers to the system codepage specified by the locale, whatever that is (Arabic/Chinese/Cyrillic/Vietnamese/...). It does not refers to Windows-1252. You can test this by changing your locale and then use notepad.exe to save a text file in "ANSI". According to [this MS documentation](https://learn.microsoft.com/en-us/windows/desktop/intl/code-page-identifiers), there are 14 different "ANSI" code pages. – Chenfeng Oct 31 '18 at 20:34
1

What is the exact difference between Windows-1252 and ISO-8859-1?

– Compare the character sets of Windows-1252 (CP-1252) and ISO-8859-1.
If you inspect the charts, you'll notice that Windows-1252 has 27 characters that are not defined in ISO-8859-1. There is no other difference. (You may need to click the image to enlarge it.)

ISO-8859-1 and Windows-1252 (CP-1252) compared

^ click to enlarge

Here is the same information in just one chart.

Windows-1252, the difference to ISO-8859-1 in red.

Answers to your other questions

What is is the difference between Windows-1252(1/3/4) and ISO-8859-1?

– I've already explained the difference between Windows-1252 and ISO-8859-1.
The difference between Windows-1252 and for example Windows-1251 is that Windows-1251 has characters in the Cyrillic alphabet that are completely missing in Windows-1252. Similarly, Windows-1253 includes the Greek alphabet and Windows-1254 the Turkish alphabet. For other languages – all ten Windows code pages, see the table I've included at the end of this answer.

Where does ANSI come into this, anyway?

– Microsoft uses ANSI as an umbrella term for its ten Windows code pages.
Microsoft uses this convention in its text editor notepad.exe in all versions of Windows,
typically located at C:\WINDOWS\System32.
Other text editors, such as Notepad2 and Notepad++ have also adopted this convention.

What encoding should we use on our Debian servers?

– You should definitely use UTF-8. See for example Character encoding | MDN.

References

Henke
  • 4,445
  • 3
  • 31
  • 44