Are there examples of ISO 8859-1 text files which are valid, but different in UTF-8?

Question

I know that UTF-8 supports way more characters than Latin-1 (even with the extensions). But are there examples of files that are valid in both, but the characters are different? So essentially that the content changes, depending on how you think the file is encoded?

I also know that a big chunk of Latin-1 maps 1:1 to the same part in UTF-8. The question is: which code points could change the value if interpreted differently (not invalid, but different)?

Does this answer your question? [Is ISO-8859-1 a Unicode charset?](https://stackoverflow.com/questions/12794825/is-iso-8859-1-a-unicode-charset) — Quentin, Dec 08 '21 at 10:17
Re edit: The accepted answer to the duplicate question covers that. — Quentin, Dec 08 '21 at 10:23
@Quentin I don't see how the accepted answer covers it. Are there now characters in Latin-1 / extensions that can get confused? — Martin Thoma, Dec 08 '21 at 10:25
If you interpret any UTF-8 file that uses any non-ASCII characters as Latin-1, you'll get a whole lot of "weird" characters, yet it's "valid Latin-1"… Is that what you're asking?! — deceze, Dec 08 '21 at 10:31

deceze · Accepted Answer · 2021-12-08T10:45:44.807

4

Latin-1 is a single-byte encoding (meaning 1 character = 1 byte), which uses all possible byte values. So any byte maps to something in Latin-1. So literally any file is "valid" in Latin-1. So you can interpret any file as Latin-1 and you'll get… something… as a result.

So yes, interpret any valid UTF-8 file in Latin-1. It's valid both in UTF-8 and Latin-1. The first 128 characters are the same for both encodings and both based on ASCII; but if your UTF-8 file uses any non-ASCII characters, those will be interpreted as gibberish (yet valid) Latin-1.

bytes	encoding	text
e6bc a2e5 ad97	UTF-8	漢字
e6bc a2e5 ad97	Latin-1	æ¼¢å valid but nonsensical

edited Dec 08 '21 at 10:45

answered Dec 08 '21 at 10:37

deceze

510,633
85
743
889

Perhaps also mention the common case of [mojibake](https://en.wikipedia.org/wiki/Mojibake) where UTF-8 is mistakenly rendered using Latin-1, which obviously produces a sequence of Latin-1 glyphs. (There are gaps so that not all valid UTF-8 sequences correspond to sequences of printable glyphs in Latin-1.) For example, the Swedish word `för` gets turned into `fÃ¶r` – tripleee Dec 08 '21 at 11:06
Yes, in this example, there's a soft-hyphen and an unused character mapping, which is why there are only 4 Latin-1 characters… – deceze Dec 08 '21 at 11:13
The point is: nobody can assure that `æ¼¢å` **is** "nonsensical" - it could also be some kind of password. OP wants to recognize what cannot be recognized. – AmigoJack Dec 09 '21 at 01:34
This is exactly the kind of example I was looking for! – Martin Thoma Dec 09 '21 at 07:02

score 0 · Answer 2 · answered Dec 08 '21 at 10:28

Unicode is - somewhat simplified - a character set, and UTF-8 is one of multiple encodings for the binary representation of Unicode.

ISO-8859-1 is both, a character set and encoding.

At the character set level, ISO-8859-1 is a subset of Unicode, i.e. each ISO-8859-1 character also exists in Unicode, and the ISO-8859-1 code is even equal to the Unicode codepoint.

At the encoding level, ISO-8859-1 and UTF-8 use the same binary representation for the ISO-8859-1 characters up to 127. But for the characters between 128 and 255 they differ as UTF-8 needs 2 bytes to represent them.

Example:

Word	ISO-8859-1	UTF-8
Zürich	5a fc 72 69 63 68	5a c3 bc 72 69 63 68

Are there examples of ISO 8859-1 text files which are valid, but different in UTF-8?

2 Answers2