1

I know that UTF-8 supports way more characters than Latin-1 (even with the extensions). But are there examples of files that are valid in both, but the characters are different? So essentially that the content changes, depending on how you think the file is encoded?

I also know that a big chunk of Latin-1 maps 1:1 to the same part in UTF-8. The question is: which code points could change the value if interpreted differently (not invalid, but different)?

AmigoJack
  • 5,234
  • 1
  • 15
  • 31
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • Does this answer your question? [Is ISO-8859-1 a Unicode charset?](https://stackoverflow.com/questions/12794825/is-iso-8859-1-a-unicode-charset) – Quentin Dec 08 '21 at 10:17
  • Re edit: The accepted answer to the duplicate question covers that. – Quentin Dec 08 '21 at 10:23
  • @Quentin I don't see how the accepted answer covers it. Are there now characters in Latin-1 / extensions that can get confused? – Martin Thoma Dec 08 '21 at 10:25
  • 1
    If you interpret any UTF-8 file that uses any non-ASCII characters as Latin-1, you'll get a whole lot of "weird" characters, yet it's "valid Latin-1"… Is that what you're asking?! – deceze Dec 08 '21 at 10:31

2 Answers2

4

Latin-1 is a single-byte encoding (meaning 1 character = 1 byte), which uses all possible byte values. So any byte maps to something in Latin-1. So literally any file is "valid" in Latin-1. So you can interpret any file as Latin-1 and you'll get… something… as a result.

So yes, interpret any valid UTF-8 file in Latin-1. It's valid both in UTF-8 and Latin-1. The first 128 characters are the same for both encodings and both based on ASCII; but if your UTF-8 file uses any non-ASCII characters, those will be interpreted as gibberish (yet valid) Latin-1.

bytes encoding text
e6bc a2e5 ad97 UTF-8 漢字
e6bc a2e5 ad97 Latin-1 æ¼¢å­ valid but nonsensical
deceze
  • 510,633
  • 85
  • 743
  • 889
  • Perhaps also mention the common case of [mojibake](https://en.wikipedia.org/wiki/Mojibake) where UTF-8 is mistakenly rendered using Latin-1, which obviously produces a sequence of Latin-1 glyphs. (There are gaps so that not all valid UTF-8 sequences correspond to sequences of printable glyphs in Latin-1.) For example, the Swedish word `för` gets turned into `för` – tripleee Dec 08 '21 at 11:06
  • Yes, in this example, there's a soft-hyphen and an unused character mapping, which is why there are only 4 Latin-1 characters… – deceze Dec 08 '21 at 11:13
  • The point is: nobody can assure that `æ¼¢å` **is** "nonsensical" - it could also be some kind of password. OP wants to recognize what cannot be recognized. – AmigoJack Dec 09 '21 at 01:34
  • This is exactly the kind of example I was looking for! – Martin Thoma Dec 09 '21 at 07:02
0

Unicode is - somewhat simplified - a character set, and UTF-8 is one of multiple encodings for the binary representation of Unicode.

ISO-8859-1 is both, a character set and encoding.

At the character set level, ISO-8859-1 is a subset of Unicode, i.e. each ISO-8859-1 character also exists in Unicode, and the ISO-8859-1 code is even equal to the Unicode codepoint.

At the encoding level, ISO-8859-1 and UTF-8 use the same binary representation for the ISO-8859-1 characters up to 127. But for the characters between 128 and 255 they differ as UTF-8 needs 2 bytes to represent them.

Example:

Word ISO-8859-1 UTF-8
Zürich 5a fc 72 69 63 68 5a c3 bc 72 69 63 68
Codo
  • 75,595
  • 17
  • 168
  • 206