0

I use multiple languages in Wordpress (and other tools) and all my databases default to CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

But one of the plugins (“Bad Behavior”)has created a table with CHARSET=latin1 COLLATE=latin1_swedish_ci

and another has CHARSET=utf8 COLLATE=utf8_general_ci

I don’t know the difference between mb4 and general but I’ve seen the garbage generated when an app ignores charset specification and assumes ISOLatin1.  So I’m concerned with whether that table might cause problems.  The "visit plugin site" link goes to an Italian site for artists, events, and promoters (i.e., no connection to the plugin).

Seems to me like a question for Wordpress.SE, but they say anything about plugins is off-topic and refer me to here.

UPDATE: Turns out the table is empty, i.e., has no rows.  So I’ll leave alone until (if ever) the plugin adds data to it.  I don't know what happens if one tries to add Chinese or Greek to a table set for Latin1.  Nor do I know whether "Bad Behavior" might try to do that, but I certainly know that spammers work in many languages.  Another blog I used to maintain got lots of Cyrillic spam.

WGroleau
  • 448
  • 1
  • 9
  • 26
  • 1
    change the table columns to all have utf8mb4_unicode_ci and test your plugins to see if they cope with mb4 characters in whatever data those tables have? do note that a table charset/collation don't do anything beside be the defaults for new columns added. – ysth Jul 04 '23 at 23:07
  • Required reading: [UTF-8 Everywhere](https://utf8everywhere.org/) and [UTF-8 all the way through](https://stackoverflow.com/questions/279170/). – JosefZ Jul 05 '23 at 06:48
  • A 25-page manifesto promoting UTF-8 is not helpful to someone who has been trying to avoid other encodings for years. – WGroleau Jul 05 '23 at 15:00

1 Answers1

-2

Short Answer

All Italian characters are handled by either Character set. I don't know if there are any subtle Collation differences, but they may not matter.

Long Answer

We are stuck with encodings. utf8mb4 is the ultimate; there is essentially now, and 'never' will be, any need for another encoding. It handles text for any language in the world, plus Emoji. And it is extensible, meaning that if a new language springs up, it can be added without [yet again] breaking existing files, products, programs, etc.

MySQL picked lating1 a quarter of a century ago, before UTF-8 was more than 'wishful thinking'. Latin1 was good enough for Western Europe, but useless for the rest of the world.

It was painful to switch from latin1 to utf8mb4. And it was made even more painful by a misstep in 5.5 with utf8mb3 (called "utf8" at that time). 8.0 bit the bullet and forced the full utf8mb4 down our throat, Essentially, the only use for other encodings is when a non UTF-8 document comes along. With one setting, no user code, such a document can be automatically converted to utf8mb4 while it is being read into the table. And, if necessary, converted back while being read from the table.

WordPress users mostly use MySQL 5.5 or 5.6 -- the versions where UTF-8 handling was screwed up. Step 1 is to upgrade to 5.7 or 8.0. Beat on your could provider if they are in control of that.

A high percentage of users of WP / MySQL are confused by the terms "Character set" and "Collation".

The Character set (latin1 or utf8mb4) specifies the "encoding" of characters. (English characters are identically encoded, so this is less of a problem for many users.)

The Collation determines how text is sorted or selected. The simplest example is "case insensitive", where 'A' and 'a' are treated as equal. This is indicated by a COLLATION ending in _ci. Most applications are happy with that.

The _general_ or _unicode__ or _0900_, etc, indicate variations of what to do with accents, "phone book" ordering, etc. Most applications need not be concerned, and the default is probably OK.

If your WP world is just in the US or Western Europe, none of this matters, except for the disruptions. But note that Emoji won't work (at least not correctly) without utf8mb4.

PS. Once you get past the utf8mb4 hurdles, this plugin can make WP run faster: WP Index Improvements

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I'm sure this information is useful to someone who didn't learn it years ago. But it is not an answer to the question of "whether that table might cause problems." Italian is only one of the many languages I work with, but emoji is not. – WGroleau Jul 05 '23 at 16:59
  • "I don't know what happens if one tries to add Chinese or Greek [or Cyrillic] to a table set for Latin1." -- S**t happens; don't do it. – Rick James Jul 05 '23 at 21:22
  • "I don’t know the difference between mb4 and general" - Those are not parallel terms. Hence, I launched into a primer. – Rick James Jul 05 '23 at 21:24
  • Not true. Latin1 doesn't have Euro symbol (it is in Latin15), nor the ‰ symbol. It lacks of some useful typographic characters, and also the `long s` (ſ) which one of the dictionary I used was a big fan. Note: Latin15 lacks of acute accent (without characters), present in Latin1, and in Italian it is important to distinguish it from apostrophe. -- Latin1 is for past. When it was developed, there were already methods to switch charsets (and BTW done by same organization: ECMA): Latin 1 extension were just one set – Giacomo Catenazzi Jul 06 '23 at 09:48
  • I think the computing industry is moving toward UTF-8, leaving all other encodings as "legacy" and encouraging users to convert to UTF-8 (utf8mb4 in MySQL) for storing in databases. – Rick James Jul 06 '23 at 18:14