I work with human-generated text which I download from different online datasets like GitHub Torrent, Twitter API, web-scraped HTML pages, Google BigQuery for GitHub etc. which means I have tens and hundreds of millions of text in the databse.
In which scenarios I should be setting a collation for UTF8 fields and UTF8 tables in MySQL databases? Is it necessary at all, cannot I simply use "CHARACTER SET UTF8"?
What are the differences between utf8 - default collation, utf8_unicode_ci, utf8_general_ci and utf8_general_mysql500_ci?