169

what is the difference between utf8 and latin1?

binbash
  • 1,699
  • 2
  • 11
  • 3
  • 5
    They are different encodings (with *some* characters mapped to common byte sequences, e.g. the ASCII characters and many accented letters). UTF-8 is one encoding of Unicode with all its codepoints; Latin1 encodes less than 256 characters. – ShreevatsaR Apr 25 '10 at 16:45
  • There is also latin9 which is available in Linux locales and could have been mentioned in the question: https://en.wikipedia.org/wiki/ISO/IEC_8859-15 – baptx Apr 06 '20 at 17:19
  • Does this answer your question? [What is the difference between UTF-8 and ISO-8859-1?](https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1) – Karl Knechtel Aug 05 '22 at 02:57

2 Answers2

185

UTF-8 is prepared for world domination, Latin1 isn't.

If you're trying to store non-Latin characters like Chinese, Japanese, Hebrew, Russian, etc using Latin1 encoding, then they will end up as mojibake. You may find the introductory text of this article useful (and even more if you know a bit Java).

Note that full 4-byte UTF-8 support was only introduced in MySQL 5.5. Before that version, it only goes up to 3 bytes per character, not 4 bytes per character. So, it supported only the BMP plane and not e.g. the Emoji plane. If you want full 4-byte UTF-8 support, upgrade MySQL to at least 5.5 or go for another RDBMS like PostgreSQL. In MySQL 5.5+ it's called utf8mb4.

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • 33
    Mysql 5.1 supports 3 byte UTF-8, however Mysql 5.5 [does support](http://dev.mysql.com/doc/refman/5.5/en/charset-unicode.html) 4 byte UTF-8 as utf8mb4. – velcrow Aug 22 '11 at 18:02
  • True that, but MySQL 5.5 wasn't GA at the moment this answer was posted. It was released December 2010. – BalusC May 02 '12 at 18:34
  • 2
    @BalusC Can you elaborate more on how UTF-8 isn't fully supported? Does it mean that Mysql 5.1 can't store *all* unicode characters? – Pacerier Jun 12 '12 at 05:54
  • 2
    @Pacerier: it only supports 3 bytes per character, thus only the BMP (the first 65535 characters) is supported, the remnant not. For all characters, see http://en.wikipedia.org/wiki/Plane_(Unicode) – BalusC Jun 12 '12 at 11:01
  • @BalusC So how do we store the unicode character `LINEAR B SYLLABLE B008 A`? : http://www.fileformat.info/info/unicode/char/10000/index.htm – Pacerier Jun 12 '12 at 18:29
  • @Pacerier: Upgrade to MySQL 5.5. – BalusC Jun 12 '12 at 18:45
  • 2
    @BalusC As for people using 5.1.63 and don't have the privilege to update the web server's mysql version, what may be the alternatives? – Pacerier Jun 12 '12 at 18:54
  • 6
    @Pacerier: You could save as `VARBINARY` instead of `VARCHAR` and decode/encode in the business tier yourself, but this is hacky. Consider asking a new question, maybe there are better ways. – BalusC Jun 12 '12 at 18:57
  • Good answer! Sorry to nitpick. Chinese, Japanese, Hebrew are languages and contain characters. But Cyrillic is a language system (and contains languages). – HoldOffHunger Aug 23 '18 at 20:41
  • 1
    @HoldOffHunger: Right, answer has been adjusted. – BalusC Aug 24 '18 at 15:42
  • @Ali *"Before that version, it only goes up to 3 bytes, not 4 bytes per character."* And there's nothing specifically to "Mysql 5.1". The change was in MySQL 5.5. – BalusC Feb 01 '19 at 11:53
  • You didn't answer the question, stackoverflow requires people to respond technically, not saying who is most used. Your answer is a typical offtopic. – e-info128 May 15 '20 at 05:33
63

In latin1 each character is exactly one byte long. In utf8 a character can consist of more than one byte. Consequently utf8 has more characters than latin1 (and the characters they do have in common aren't necessarily represented by the same byte/bytesequence).

sepp2k
  • 363,768
  • 54
  • 674
  • 675
  • 1
    What about ascii and bin? – Yousha Aleayoub May 17 '17 at 10:54
  • 10
    @YoushaAleayoub ASCII is a single-byte encoding which uses the characters 0 through 127, so it can encode half as many characters as latin1. It's a strict subset of both latin1 and utf8, meaning the bytes 0 through 127 in both latin1 and utf8 encode the same things as they do in ASCII. Bin isn't an encoding. It's usually an option that you can give when reading a file, telling the IO functions to not apply any encoding, but instead just read the file byte by byte. – sepp2k May 17 '17 at 11:38
  • 1
    thanks, I meant `binary` collate...? and which one is better for english/numeric fields: `ascii_general_ci` or `ascii_bin`? – Yousha Aleayoub May 17 '17 at 12:29