1

I work with human-generated text which I download from different online datasets like GitHub Torrent, Twitter API, web-scraped HTML pages, Google BigQuery for GitHub etc. which means I have tens and hundreds of millions of text in the databse.

In which scenarios I should be setting a collation for UTF8 fields and UTF8 tables in MySQL databases? Is it necessary at all, cannot I simply use "CHARACTER SET UTF8"?

What are the differences between utf8 - default collation, utf8_unicode_ci, utf8_general_ci and utf8_general_mysql500_ci?

oski86
  • 855
  • 1
  • 13
  • 35

1 Answers1

3

Every textual column has a collation. It may be set explicitly in the table definition, or it may simply be set from the table's default, the database's default, or the server-wide default. But it has a collation.

The collations you mention are all case-insensitive. That is, they ignore the difference between upper- and lower- case letters. If you want case-sensitive collations use utf8_binary.

You probably want to use utf8_unicode_ci in a modern server. Read this for background. What's the difference between utf8_general_ci and utf8_unicode_ci

utf8_general_mysql500_ci is a collation specifically for backward compatibility to older versions of MySQL. http://dev.mysql.com/doc/relnotes/mysql/5.5/en/news-5-5-21.html

Community
  • 1
  • 1
O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • 1
    Another discussion: http://mysql.rjweb.org/doc.php/charcoll#utf8_collations . That section is followed by examples of where ORDER BY differs, then a section on the details of many utf8 collations -- what characters are ordered differently. – Rick James Feb 25 '15 at 22:45