Optimization: Storing ISO 639-3 language in a MySQL column (int language_id or varchar language)

Question

I have a MySQL DB that needs to be fast at scale.

Option 1 Tables can store the language ISO 639-3 code as a column: varchar(3) language

Option 2 Tables can store the ID for the language as a column: int(2?) language_id, and there can be a languages table with the ISO 639-3 code.

Question What makes sense for speed at scale? Option 1 is easier to read in the DB. I'd prefer it if speed is the same or completely negligible even at scale.

Thanks!

score 0 · Answer 1 · answered Dec 31 '20 at 20:12

I recommend:

CREATE TABLE ...
    ISO_630_3 CHAR(3) CHARACTER SET ascii

That will be 3 bytes, which is smaller than INT (4 bytes)and not much bigger thanSMALLINT UNSIGNED` (2 bytes).

(Am I correct in saying that the codes are always 3 ascii letters? Hence no need for VAR, which takes an extra byte or two.)

CHAR(3) is readily indexable. There is no significant advantage in 'normalizing' even to smallint. This still applies even at the scale of a billion rows.

And, as you point out, "easier to read" is worth something.

If you are also storing text, I assume that all such text can be mapped to UTF-8? If so, use

     my_text TEXT CHARACTER SET utf8mb4

In MySQL, there is no problem having different columns in a single table using different charsets (or collations).

Perhaps worth noting... Many languages can be discovered from the hex utf-8 encoding:

⚈  Cxyy -- More Western Europe: Latin (C3-CA), Combining Diacritical Marks (CC-CD), Greek (CE-CF)
⚈  Dxyy -- Cyrillic (D0-D4), Hebrew (D6-D7), Arabic/Persian/Farsi (D8-DB), etc
⚈  E0yyyy -- various Indian character sets, southern Asia, etc.
⚈  E1yyyy -- Cherokee, Balinese, Khmer, Mongolian, Vietnamese, etc.
(etc)

-- http://mysql.rjweb.org/doc.php/charcoll#diagnosing_charset_issues

Optimization: Storing ISO 639-3 language in a MySQL column (int language_id or varchar language)

1 Answers1