What is the difference between UTF-8 and ISO-8859-1?

Question

score 390 · Answer 1 · answered Aug 13 '11 at 05:26

390

UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.

answered Aug 13 '11 at 05:26

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

21

One thing to note that ASCII extends from 0 to 127 only. The MSB is always 0. – Hritik Jan 27 '18 at 12:03
3

When code points above 127 are defined, the encoding system is a version of Extended ASCII. – Rohan Bhale Aug 01 '19 at 08:50
6

@RohanBhale Don't use the phrase Extended ASCII; it'll only cause confusion. – Mr Lister Mar 19 '20 at 16:18
1

But extended ascii might be the correct term. I read it on multiple resources – Rohan Bhale Mar 20 '20 at 12:24
I always heard it as *High ASCII*. – Mar 16 '22 at 22:19
In over 30 years of MsDos, windows, *nix, and the internet I've never heard "high" ASCII ever mentioned. Its always been "Extended ASCII" – StingyJack Apr 22 '23 at 02:03

StaxMan · Answer 2 · 2011-08-13T19:52:19.440

158

Wikipedia explains both reasonably well: UTF-8 vs Latin-1 (ISO-8859-1). Former is a variable-length encoding, latter single-byte fixed length encoding. Latin-1 encodes just the first 256 code points of the Unicode character set, whereas UTF-8 can be used to encode all code points. At physical encoding level, only codepoints 0 - 127 get encoded identically; code points 128 - 255 differ by becoming 2-byte sequence with UTF-8 whereas they are single bytes with Latin-1.

edited Aug 13 '11 at 19:52

answered Aug 13 '11 at 05:30

StaxMan

113,358
34
211
239

@mu maybe my statement was ambiguous, but it is not incorrect -- I was not talking about encoded byte sequences, but rather character sets being encoded; meaning that ISO-8859-1 is used to encode first 256 code points of the Unicode character set. – StaxMan Aug 13 '11 at 19:50
1

Your clarification works for me and "ambiguous" would have been a better word choice than "incorrect". – mu is too short Aug 14 '11 at 00:50

Sammitch · Answer 3 · 2022-01-25T23:12:06.690

UTF

UTF is a family of multi-byte encoding schemes that can represent Unicode code points which can be representative of up to 2^31 [roughly 2 billion] characters. UTF-8 is a flexible encoding system that uses between 1 and 4 bytes to represent the first 2^21 [roughly 2 million] code points.

Long story short: any character with a code point/ordinal representation below 127, aka 7-bit-safe ASCII is represented by the same 1-byte sequence as most other single-byte encodings. Any character with a code point above 127 is represented by a sequence of two or more bytes, with the particulars of the encoding best explained here.

ISO-8859

ISO-8859 is a family of single-byte encoding schemes used to represent alphabets that can be represented within the range of 127 to 255. These various alphabets are defined as "parts" in the format ISO-8859-n, the most familiar of these likely being ISO-8859-1 aka 'Latin-1'. As with UTF-8, 7-bit-safe ASCII remains unaffected regardless of the encoding family used.

The drawback to this encoding scheme is its inability to accommodate languages comprised of more than 128 symbols, or to safely display more than one family of symbols at one time. As well, ISO-8859 encodings have fallen out of favor with the rise of UTF. The ISO "Working Group" in charge of it having disbanded in 2004, leaving maintenance up to its parent subcommittee.

Windows Code Pages

It's worth mentioning that Microsoft also maintains a set of character encodings with limited compatibility with ISO-8859, usually denoted as "cp####". MS seems to have a push to move their recent product releases to using Unicode in one form or another, but for legacy and/or interoperability reasons you're still likely to run into them.

For example, cp1252 is a superset of the ISO-8859-1, containing additional printable characters in the 0x80-0x9F range, notably the Euro symbol € and the much maligned "smart quotes" “”. This frequently leads to a mismatch where 8859-1 can be displayed as 1252 perfectly fine, and 1252 may seem to display fine as 8859-1, but will misbehave when one of those extra symbols shows up.

Aside from cp1252, the Turkish cp1254 is a similar superset of ISO-8859-9, but all other Windows Code Pages have at least some fundamental conflicts, if not differing entirely from their 8859 equivalent.

+1 for answering the question but going beyond and offering info about related encodings. Re: code points for UTF-8, according to https://stackoverflow.com/a/38488358/3353984, UTF-8 supports 2^21 code points. Is that an error, or might a fix be needed here? — Tom Loredo, Dec 17 '18 at 00:27
Unicode is actually 17 planes of 2^16 code points. 0x00_0000 to 0x1F_FFFF. The 17 planes can accommodate 1,114,112 code points. Of these, 2,048 are surrogates, 66 are non-characters, and 137,468 are reserved for private use, leaving 974,530 for public assignment.about 1 million. See [How many characters can UTF-8 encode?](https://stackoverflow.com/a/45042566/5535245). — georgeawg, Dec 11 '19 at 22:12

score 37 · Answer 4 · edited Nov 10 '18 at 10:46

37

ASCII: 7 bits. 128 code points.
ISO-8859-1: 8 bits. 256 code points.
UTF-8: 8-32 bits (1-4 bytes). 1,112,064 code points.

Both ISO-8859-1 and UTF-8 are backwards compatible with ASCII, but UTF-8 is not backwards compatible with ISO-8859-1:

#!/usr/bin/env python3

c = chr(0xa9)
print(c)
print(c.encode('utf-8'))
print(c.encode('iso-8859-1'))

Output:

©
b'\xc2\xa9'
b'\xa9'

edited Nov 10 '18 at 10:46

Damian Vogel

1,050
1
13
19

answered Oct 28 '18 at 23:04

Cyker

9,946
8
65
93

score 27 · Answer 5 · answered Jun 03 '16 at 19:31

27

ISO-8859-1 is a legacy standards from back in 1980s. It can only represent 256 characters so only suitable for some languages in western world. Even for many supported languages, some characters are missing. If you create a text file in this encoding and try copy/paste some Chinese characters, you will see weird results. So in other words, don't use it. Unicode has taken over the world and UTF-8 is pretty much the standards these days unless you have some legacy reasons (like HTTP headers which needs to compatible with everything).

answered Jun 03 '16 at 19:31

Shital Shah

63,284
17
238
185

1

I had seen where Umlaut's are not supposedly converted with UTF8. We saw examples of this and in searching we found the ISO-8859-1 and it seems to work. We have a lot of German Scientist we work with. – Aggie Jon of 87 Jul 25 '18 at 15:20
5

Umlaut's are represented as two characters in utf8. They convert fine and work well. The problem comes from programs that expect 1 byte per character. For these legacy programs, ISO-8859-1 has 1-byte umlaut's. – Erik Aronesty Sep 13 '18 at 16:39
2

"So in other words, don't use it." I wouldn's say so, because there are use cases where ISO-8859-1 suits much better then UTF-8 because single byte and 256 chars can be sufficient, resulting in faster processing and less payload. – AndreasRu Apr 11 '21 at 12:38
Just as an example of where single byte encoding is preferred, SMS messages have a limit of 140 bytes and primarily use single-byte encoding. If you were a business that sends automated SMS messages, you don't want to double your cost just to not use a legacy standard. – Caleb McNevin Jun 18 '21 at 19:05

score 4 · Answer 6 · edited Apr 09 '23 at 19:49

One more important thing to realise: if you see iso-8859-1, it probably refers to Windows-1252 rather than ISO/IEC 8859-1. They differ in the range 0x80–0x9F, where ISO 8859-1 has the C1 control codes, and Windows-1252 has useful visible characters instead.

For example, ISO 8859-1 has 0x85 as a control character (in Unicode, U+0085, ``), while Windows-1252 has a horizontal ellipsis (in Unicode, U+2026 HORIZONTAL ELLIPSIS, …).

The WHATWG Encoding spec (as used by HTML) expressly declares iso-8859-1 to be a label for windows-1252, and web browsers do not support ISO 8859-1 in any way: the HTML spec says that all encodings in the Encoding spec must be supported, and no more.

Also of interest, HTML numeric character references essentially use Windows-1252 for 8-bit values rather than Unicode code points; per https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state,  will produce U+2026 rather than U+0085.

score 3 · Answer 7 · answered Apr 15 '18 at 05:49

3

From another perspective, files that both unicode and ascii encodings fail to read because they have a byte 0xc0 in them, seem to get read by iso-8859-1 properly. The caveat is that the file shouldn't have unicode characters in it of course.

answered Apr 15 '18 at 05:49

Nikhil VJ

5,630
7
34
55

score 0 · Answer 8 · answered Sep 02 '16 at 14:20

0

My reason for researching this question was from the perspective, is in what way are they compatible. Latin1 charset (iso-8859) is 100% compatible to be stored in a utf8 datastore. All ascii & extended-ascii chars will be stored as single-byte.

Going the other way, from utf8 to Latin1 charset may or may not work. If there are any 2-byte chars (chars beyond extended-ascii 255) they will not store in a Latin1 datastore.

answered Sep 02 '16 at 14:20

Alan Jurgensen

813
11
20

2

Helpful, but I think you meant 127 instead of 255 in extended-ascii 255? – Mar 19 '17 at 16:36
24

Latin-1, or iso-8859-1 is not 100% compatible to be stored in utf8. Any Latin-n or iso-8859-n character above 127 will not be translated to a single byte utf-8 character. However, for values 1-127, they will translate exactly. – Marlin Pierce Nov 28 '17 at 18:22
6

This answer is a bit confusing in its use of the term "extended ascii", which just is a term to refer to any character encoding that is not ASCII. UTF-8 and latin-1 are examples of extended-ASCII encodings. But, non-ascii latin-1 characters (ie. code points above 127) cannot be encoded as a single byte in UTF-8. – rdb Apr 18 '18 at 11:26
In UTF-8 2 byte encodings begin at 128. However there are matching characters in both, so it is possible to go: ISO 8859-1 -> UTF-8 -> ISO 8859-1 losslessly but if there are any characters in a UTF-8 document greater than 255 then it cannot be converted losslessly. – silicontrip Oct 23 '20 at 21:54

What is the difference between UTF-8 and ISO-8859-1?

8 Answers8

UTF

ISO-8859

Windows Code Pages

Linked

Related