Nicholas you seem to have some fundamental confusions with Ascii Vs UTF-8 Character sets in your Question and your comments to answer(s).
UTF-8 Value: @.
ASCII Length (from UTF-8 string):14
ASCII Length:5
ASCII Value:?@?.
I would expect the length to be 14 characters in the ASCII string once it's converted?
No, If the Panda Face UTF-8 character was represented in Ascii how would it be represented? At best this would be subjective such as with a <3
or a B-)
etc.
There is no translation of the Pandaface, hence it is substituted with the placeholder ?
in the output character set. It is somewhat like trying to spell king but only with vowels. There are simply less ascii options than UTF8
.
So please take away that Ascii is a practical sub-set of UTF-8
, not vice versa.
MySQL Unique Storage Solution
MYSQL Unique indexes have a limit of 767 bytes in total. You can chain these indexes together and for any table MySQL can provide a total unique index of 3072 bytes. For the purposes of using a single index column of collation UTF8mb4_unicode_ci
(ie, the one you should be using) then the unique index column would be:
<max index size in bytes> / <max bytes per character in collation>
767 / 4 = 191 characters.
Therefore MySQL will only unqiuely index the first 191 characters of any UTF-8 string.
To sidestep this limiter, you would make a new table, with two columns, an Auto_increment
integer column, and a varchar column:
CREATE TABLE `emails` (
`id` int(8) NOT NULL AUTO_INCREMENT,
`email` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`(191))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
Then each time a new email address is added, you search this table if it already exists (the column is indexed but is not unique
) and if not, then the email address is inserted and referenced by the id
column.
The email
column is always UTF8mb4 because this is full UTF8 unlike the MySQL standard utf8_
collation. MySQL can't uniquely qualify data larger than 767 bytes as you have said, but if your various other tables reference the id row for the email, that column on the other tables can be unique.
Some Further thoughts
1 htmlentities
is not an effective solution because for any character the size of it's entity is always bigger, take the >
character, which is >
this is already 4 characters in length at best case, even if each of these could be stored in "1 byte", this would still be a larger storage than with >
which as a general UTF-8 character at worst case would be 4-bytes.
htmlentities
will only effect characters that have a specified HTML alternative, and I'm unsure if things like <PandaFace>
or <shitpoo>
have htmlentities(?).
2 What is the longest email address you have ever seen or even ever used, that is a real genuine address? While the maximum size of email addresses is 254 ascii characters, that is:
thisisaverylongandtediousemailaddresswhichisprettyimpractical.
andonlyreallyworth.jacksquitintheamount.ofspacethiscantakeupinyourdatabase
@home.somewhere.overtherainbow.ornear.somepot.of.irishgold.thinkaboutthis.
thisemailisthemaximumlengthallowed.co.uk.com
Now look at that code, that is the longest allowed ascii email address by definition. That' pretty long and while not impossible, the number of users who have email addresses (in ascii) of this length will be an extreme edgecase.
Taking this a step futher down this line, say that you have an email address that is 64 UTF-8 4-byte characters as you've set as the upper utf-8 limit,
So as ascii something of this length:
horsesandgoastandcatsanddogsandfleas@some.petting.zoo.org.uk.com
But as UTF-8 4byte characters and say this above email was translated into certain UTF-8 Chinese character sets, this email address length is still the upper range of what is practical for humans to actually use and have as their addresses. But it is not quite out the park, it's unlikely unless you're aiming for a specific market audience.
The MySQL Unique Indexing of 767 bytes would limit you to approximately 191 4-byte UTF-8 characters, then you'd be limited to 47 fully UTF-8 characters in an email address featuring 2 (well, max 3) non UTF-8 4byte character (such as @
and .
).
Example:
thisIsAnEmailOfUTF8CharasandA@IntheRightPlace.com
Now remember that this email doesn't look that long, it's of a more realistic size than others, but each and every character (except .
and @
) would need to be at the 4byte UTF-8 encoding for this to hit the MySQL unique index limit, so for example if each of the characters in the email was of a certain non-latin language such as Ethiopian or certain UTF-8 Chinese sets.
3
It is also worth noting that Chinese (and I think Japanese) characters are each words or syllabales in their own right (therefore bigger than simply letters), so (I hazard) few Chinese would have excessive email addresses instead you'd have:
猫@空间农场.com
This is donkey@spacefarm.com
*, taking up 10 character spaces in Chinese, whereas the ascii latin takes up 20 character spaces.
Further to this there are some (sub)sets of Chinese and Japanese characters that are still not present in the UTF-8 standard. (annoyingly, the example above is one of these).
*^ Google translation, so may be wrong!
Some Conclusion Options
Store your Email in plaintext UTF-8 in a specific table with a unique AI column (as outlined above). reference/cross-reference the column AI id number to discover if the email text is unique on any other field/column in the database. Do not Unique the email column, simply index it, but unique the index reference to that column.
Store the Email address as a hash and check if the hash is unique such as with sha1
in PHP . SHA1
is better than MD5 because it is a longer hash so can accept more values without collisions (although collisions are still possible). Sha hashes are always 160 bit or 40 characters long and therefore comfortably fit into the MySQL unique column constraints.
Store your email address to a VARCHAR(190)
length and expect that to cover 98%+ of your database usership.
MySQL unique index limit is not as likely to effect your emails as the criteria for valid email length.
You may be able to get away with using email addresses that are technically questionably valid but weather these are accepted by routers and DNS servers is pretty much up to each server.
Email is an old and anachronistic way of transporting data. Consider the future will be more like SnapChat [example] and other database based authenticated communications which have few of the curtailing limits that email inherits. Email is also very tedious to code with and prone to a wide variety of issues errors and problems as well as extremely poor security overheads.
MySQL Storing The Email Address
Option 1 ) Hash the email address and store the hash in a unique column.
Postives:
This will mean you can store the email in the same column as you'd originally intended. Email should be fixed length sha
hash. MySQL Unique column contstraint would be valid.
Negatives
Hash collisions would be possible, email address itself would not be searchable or "de-codable".
Option 2 ) Store the email address plaintext in the UTF-8 column and simply limit the email VARCHAR
field size to 190 characters.
Positives:
This would probably cover all likely valid email addresses.
Negatives:
Longer email addresses would be invalid and truncated, meaning they would be saved without error but would not be the same text strings (due to truncation).
Option 3 ) Store the Email in a new MySQL table with an indexed VARCHAR
column and an auto_increment
numerical reference column as detailed above.
This would then mean that any occurance of the email text would be replaced by a numerical reference for that row in the database. The column that features the original email text can then be a unique index.
Positives:
This means you can store emails as unique entities and can carry out SQL checks for if they already appear.
Negatives:
This would mean changing your current coding and SQL commands slightly to accommodate this new table as a reference table.
Example
Email Reference Table:
CREATE TABLE `email_reference` (
`id` int(8) NOT NULL AUTO_INCREMENT,
`email` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`(191))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
Users (example) table:
CREATE TABLE `userdata` (
`user_id` int(8) NOT NULL AUTO_INCREMENT,
`name` varchar(90) COLLATE utf8mb4_unicode_ci NOT NULL,
`email_ref` int(11) DEFAULT NULL,
`details` text COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`user_id`),
UNIQUE KEY `email_ref` (`email_ref`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
The above userdata
table will have a unique column for email ref
which will reference the email table. This unique column means no two userdata
rows can reference the same row in the email_reference
table.
Because it is a UNIQUE
column it is a good idea to allow NULL values for if anyone for any reason doesn't have an email or other such "uniqueness escape" situations.
The long and the short of my long post is that I think your concerns appear to be mostly edge cases or due to imperfect Database structural design, rather than due to issues with character sets or unique keys themselves. If what you're envisaging with your system are not edge cases then using the MySQL AI int
reference system I have outlined above should, with a little bit of foresight on your part, cover your needs.