1

I'm trying to support UTF-8 characters into email addresses. If I understand correctly, email addresses are limited to 254 usable (ASCII) characters. Based on this, I would like to store email address in a VARCHAR(254) ASCII MySQL InnoDB column. One of the problems I'm encountering is to validate such scenarios. I'm trying to convert UTF-8 to ASCII but getting mixed results as shown below (I know the example is not a valid email but I could have used other characters - this is just to explain the problem):

<?php
$string = '@.';
echo 'UTF-8 Value: ' . $string . '<br/>';
echo 'ASCII Length (from UTF-8 string):' . mb_strlen($string, 'ASCII') . '<br/>';
$stringAscii =  mb_convert_encoding($string, 'ASCII', 'UTF-8');
echo 'ASCII Length:' . strlen($stringAscii) . '<br/>';
echo 'ASCII Value:' . $stringAscii . '<br/>';

The output is:

UTF-8 Value: @.

ASCII Length (from UTF-8 string)::14

ASCII Length:5

ASCII Value:?@?.

I would expect the length to be 14 characters in the ASCII string once it's converted? How can I convert the UTF-8 string to ASCII without losing its original length and value? Basically I'm looking for a way to store a UTF-8 string into its ASCII format while being able to convert it back to its original UTF-8 format.

I also tried other type of encoding output (e.g. byte outputs) but was unable to find any output matching the 14 characters length. I also tried iconv which is returning exceptions for there characters. The idea to convert in ASCII is that I can support this value as a primary key of a table in MySQL within my VARCHAR(254). I could always try to convert to HTML-ENTITIES but it will be hard to predict the maximum size of the string to reflect it in the DB schema.

An other option is to use a UTF-8MB4 encoded VARCHAR(256) column in MySQL but when used as a primary key, this will go above the 767 bytes index limit and require to enable large index in InnoDB which I would prefer to avoid.

Is there a way to achieve what I'm trying to do without using innodb_large_prefix=on in MySQL?

Community
  • 1
  • 1
Nicolas Bouvrette
  • 4,295
  • 1
  • 39
  • 53
  • from the same page [Email_address#Internationalization_examples](https://en.wikipedia.org/wiki/Email_address#Internationalization_examples) – Mi-Creativity Aug 06 '16 at 17:19
  • @Mi-Creativity Not sure I understand how this helps? The example I gave has a valid local part but the domain is not valid. The question remains the same. – Nicolas Bouvrette Aug 06 '16 at 17:25
  • Not all email addresses are in ASCII format, if you look at that link you'll see Greek, Japanese and Chinese emails – Mi-Creativity Aug 06 '16 at 17:30
  • The domain converts separately and is covered by IDNA. – Ignacio Vazquez-Abrams Aug 06 '16 at 17:30
  • @IgnacioVazquez-Abrams Yes we need to use puny code and the maximum domain name is also 253 characters but this is not the problem I have here. – Nicolas Bouvrette Aug 06 '16 at 17:37
  • 1
    Yes, the problem is that you're trying to put non-ASCII directly into an ASCII store. That won't work. – Ignacio Vazquez-Abrams Aug 06 '16 at 17:39
  • @IgnacioVazquez-Abrams Actually trying to convert UTF-8 to ASCII and then store it in an ASCII store using PHP. The problem seems to be at the conversion level which is not working as I would expect. – Nicolas Bouvrette Aug 06 '16 at 17:41
  • 2
    That's because it *doesn't* convert unless your UTF-8 only contains ASCII characters in the first place. – Ignacio Vazquez-Abrams Aug 06 '16 at 17:45
  • @IgnacioVazquez-Abrams Ok I'm not sure I understand exactly why it doesn't convert but that's probably because I'm not savy in encoding. Do you think that using HTML-ENTITIES would be a good approach? If I'm not mistaken the maximum length of a UTF-8 character is 6 (in ASCII) and 10 in HTML-ENTITIES? With this info I would be able to provision MySQL accordingly. – Nicolas Bouvrette Aug 06 '16 at 17:49
  • Actually most answers about maximum length are wrong. Maximum length is not _254 characters_ but _254 octets_ (i.e. bytes), see e.g. http://stackoverflow.com/a/7717694/6600109 -- So a field like `VARBINARY(254)` would be optimal from storage point of view. – Markus Laire Aug 07 '16 at 19:27
  • "ASCII Length (from UTF-8 string)::14 " -- _misnomer_. That it _byte_ length, not _ASCII_ length. The 5 _characters_ `@.` takes 4+1+4+1+4 _bytes_; ASCII is not involved. – Rick James Aug 11 '16 at 21:55

3 Answers3

5

Nicholas you seem to have some fundamental confusions with Ascii Vs UTF-8 Character sets in your Question and your comments to answer(s).

UTF-8 Value: @.

ASCII Length (from UTF-8 string):14

ASCII Length:5

ASCII Value:?@?.

I would expect the length to be 14 characters in the ASCII string once it's converted?

No, If the Panda Face UTF-8 character was represented in Ascii how would it be represented? At best this would be subjective such as with a <3 or a B-) etc.

There is no translation of the Pandaface, hence it is substituted with the placeholder ? in the output character set. It is somewhat like trying to spell king but only with vowels. There are simply less ascii options than UTF8.

So please take away that Ascii is a practical sub-set of UTF-8, not vice versa.

MySQL Unique Storage Solution

MYSQL Unique indexes have a limit of 767 bytes in total. You can chain these indexes together and for any table MySQL can provide a total unique index of 3072 bytes. For the purposes of using a single index column of collation UTF8mb4_unicode_ci (ie, the one you should be using) then the unique index column would be:

<max index size in bytes> / <max bytes per character in collation> 
          767             /            4                    = 191 characters. 

Therefore MySQL will only unqiuely index the first 191 characters of any UTF-8 string.

To sidestep this limiter, you would make a new table, with two columns, an Auto_increment integer column, and a varchar column:

CREATE TABLE `emails` (
 `id` int(8) NOT NULL AUTO_INCREMENT,
 `email` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
 PRIMARY KEY (`id`),
 KEY `email` (`email`(191))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

Then each time a new email address is added, you search this table if it already exists (the column is indexed but is not unique) and if not, then the email address is inserted and referenced by the id column.

The email column is always UTF8mb4 because this is full UTF8 unlike the MySQL standard utf8_ collation. MySQL can't uniquely qualify data larger than 767 bytes as you have said, but if your various other tables reference the id row for the email, that column on the other tables can be unique.

Some Further thoughts

1 htmlentities is not an effective solution because for any character the size of it's entity is always bigger, take the > character, which is &gt; this is already 4 characters in length at best case, even if each of these could be stored in "1 byte", this would still be a larger storage than with > which as a general UTF-8 character at worst case would be 4-bytes.

htmlentities will only effect characters that have a specified HTML alternative, and I'm unsure if things like <PandaFace> or <shitpoo> have htmlentities(?).

2 What is the longest email address you have ever seen or even ever used, that is a real genuine address? While the maximum size of email addresses is 254 ascii characters, that is:

thisisaverylongandtediousemailaddresswhichisprettyimpractical.
andonlyreallyworth.jacksquitintheamount.ofspacethiscantakeupinyourdatabase
@home.somewhere.overtherainbow.ornear.somepot.of.irishgold.thinkaboutthis.
thisemailisthemaximumlengthallowed.co.uk.com

Now look at that code, that is the longest allowed ascii email address by definition. That' pretty long and while not impossible, the number of users who have email addresses (in ascii) of this length will be an extreme edgecase.

Taking this a step futher down this line, say that you have an email address that is 64 UTF-8 4-byte characters as you've set as the upper utf-8 limit,

So as ascii something of this length:

  horsesandgoastandcatsanddogsandfleas@some.petting.zoo.org.uk.com

But as UTF-8 4byte characters and say this above email was translated into certain UTF-8 Chinese character sets, this email address length is still the upper range of what is practical for humans to actually use and have as their addresses. But it is not quite out the park, it's unlikely unless you're aiming for a specific market audience.

The MySQL Unique Indexing of 767 bytes would limit you to approximately 191 4-byte UTF-8 characters, then you'd be limited to 47 fully UTF-8 characters in an email address featuring 2 (well, max 3) non UTF-8 4byte character (such as @ and .).

Example:

thisIsAnEmailOfUTF8CharasandA@IntheRightPlace.com

Now remember that this email doesn't look that long, it's of a more realistic size than others, but each and every character (except . and @) would need to be at the 4byte UTF-8 encoding for this to hit the MySQL unique index limit, so for example if each of the characters in the email was of a certain non-latin language such as Ethiopian or certain UTF-8 Chinese sets.

3 It is also worth noting that Chinese (and I think Japanese) characters are each words or syllabales in their own right (therefore bigger than simply letters), so (I hazard) few Chinese would have excessive email addresses instead you'd have:

猫@空间农场.com 

This is donkey@spacefarm.com*, taking up 10 character spaces in Chinese, whereas the ascii latin takes up 20 character spaces.

Further to this there are some (sub)sets of Chinese and Japanese characters that are still not present in the UTF-8 standard. (annoyingly, the example above is one of these).

*^ Google translation, so may be wrong!

Some Conclusion Options

  • Store your Email in plaintext UTF-8 in a specific table with a unique AI column (as outlined above). reference/cross-reference the column AI id number to discover if the email text is unique on any other field/column in the database. Do not Unique the email column, simply index it, but unique the index reference to that column.

  • Store the Email address as a hash and check if the hash is unique such as with sha1 in PHP . SHA1 is better than MD5 because it is a longer hash so can accept more values without collisions (although collisions are still possible). Sha hashes are always 160 bit or 40 characters long and therefore comfortably fit into the MySQL unique column constraints.

  • Store your email address to a VARCHAR(190) length and expect that to cover 98%+ of your database usership.

  • MySQL unique index limit is not as likely to effect your emails as the criteria for valid email length.

  • You may be able to get away with using email addresses that are technically questionably valid but weather these are accepted by routers and DNS servers is pretty much up to each server.

  • Email is an old and anachronistic way of transporting data. Consider the future will be more like SnapChat [example] and other database based authenticated communications which have few of the curtailing limits that email inherits. Email is also very tedious to code with and prone to a wide variety of issues errors and problems as well as extremely poor security overheads.


MySQL Storing The Email Address

Option 1 ) Hash the email address and store the hash in a unique column.

  • Postives: This will mean you can store the email in the same column as you'd originally intended. Email should be fixed length sha hash. MySQL Unique column contstraint would be valid.

  • Negatives Hash collisions would be possible, email address itself would not be searchable or "de-codable".

Option 2 ) Store the email address plaintext in the UTF-8 column and simply limit the email VARCHAR field size to 190 characters.

  • Positives: This would probably cover all likely valid email addresses.

  • Negatives: Longer email addresses would be invalid and truncated, meaning they would be saved without error but would not be the same text strings (due to truncation).

Option 3 ) Store the Email in a new MySQL table with an indexed VARCHAR column and an auto_increment numerical reference column as detailed above.

This would then mean that any occurance of the email text would be replaced by a numerical reference for that row in the database. The column that features the original email text can then be a unique index.

  • Positives: This means you can store emails as unique entities and can carry out SQL checks for if they already appear.

  • Negatives: This would mean changing your current coding and SQL commands slightly to accommodate this new table as a reference table.

Example

Email Reference Table:

CREATE TABLE `email_reference` (
 `id` int(8) NOT NULL AUTO_INCREMENT,
 `email` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
 PRIMARY KEY (`id`),
 KEY `email` (`email`(191))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

Users (example) table:

CREATE TABLE `userdata` (
 `user_id` int(8) NOT NULL AUTO_INCREMENT,
 `name` varchar(90) COLLATE utf8mb4_unicode_ci NOT NULL,
 `email_ref` int(11) DEFAULT NULL,
 `details` text COLLATE utf8mb4_unicode_ci NOT NULL,
 PRIMARY KEY (`user_id`),
 UNIQUE KEY `email_ref` (`email_ref`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

The above userdata table will have a unique column for email ref which will reference the email table. This unique column means no two userdata rows can reference the same row in the email_reference table.

Because it is a UNIQUE column it is a good idea to allow NULL values for if anyone for any reason doesn't have an email or other such "uniqueness escape" situations.


The long and the short of my long post is that I think your concerns appear to be mostly edge cases or due to imperfect Database structural design, rather than due to issues with character sets or unique keys themselves. If what you're envisaging with your system are not edge cases then using the MySQL AI int reference system I have outlined above should, with a little bit of foresight on your part, cover your needs.

Martin
  • 22,212
  • 11
  • 70
  • 132
  • `猫@空间农场` does not need utf8mb4, but some other Chinese characters do. – Rick James Aug 06 '16 at 22:20
  • @Martin very impressive answer. You are right about my confusion and it's now very clear (glad I asked this question!). Also the reason I wanted to use the email address as unique index / primary key is that I didn't see any reason for the auto increment index when all queries of this table will query the email address. I might also be wrong but unique index have the same 767 bytes limitation. If you can add the precision about the 767 bytes limitation in your answer I will accept it since it would be my preferred approach! – Nicolas Bouvrette Aug 06 '16 at 22:24
  • @Martin An other comment, the HTML-ENTITIES as I suggested only converts UTF-8 characters so my calculation was still valid. I added a mention on my answer. Feel free to add this as well since, unless I missed something, this would also be a valid solution. – Nicolas Bouvrette Aug 06 '16 at 22:35
  • @NicolasBouvrette I have updated my answer with a more indepth example of how to set up an email reference table. – Martin Aug 07 '16 at 11:21
  • @Martin There is a lot of creativity in your answer but I'm not sure I understand how to use a hash considering you often need to send email back to users and you cannot recover the original value of an hash. I also think you are missing the most obvious (simple) options to solve this problenlm as mentionned in my answer. – Nicolas Bouvrette Aug 07 '16 at 12:45
  • The hash column is for checking it is unique index, the email plaintext can be stored in another column which does not have a unique index constraint. @NicolasBouvrette – Martin Aug 07 '16 at 13:33
  • I also did not mention the `InnoDb` MySQL alterations because that is referenced better in your answer (and I got the impression originally that this was a solution you wanted to avoid). – Martin Aug 07 '16 at 13:35
  • @Martin you're right but my assumption was wrong from the begining :) I just find that the hash/id solution add even mote complexity than my 632 characters ASCII varchar. In my opinion the MySQL solution is the best one for this problem. The rest is mostly interesting for a dinner conversation :) – Nicolas Bouvrette Aug 07 '16 at 13:53
  • It was an option with stating, but my prefered choice would be to have the separate email table. And I try to avoid talking work during dinner, it would bore my friends to death, haha! `:-D` – Martin Aug 07 '16 at 14:29
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/120364/discussion-between-nicolas-bouvrette-and-martin). – Nicolas Bouvrette Aug 07 '16 at 18:10
2

I'm adding the missing details in my own answer (special thanks for Ignacio, andig, Martin and Markus Laire for helping me to put the pieces of this puzzle together).

There are two problems to this question :

  1. Encoding conversion from UTF-8 to ASCII
  2. MySQL index limit to 767 bytes without enabling innodb_large_prefix for MySQL < 5.7.7 (looks like this is now enabled by default).

Answer for "Encoding conversion from UTF-8 to ASCII"

ASCII is a subset of UTF-8 so not all characters can be converted. ASCII only uses 128 characters (the first 128) per byte while UTF-8 bytes can use more. The ideal solution would be to use an encoding that support all 256 possibilities per 8-bits bytes. Some encoding like cp1252 supports most characters but even if this is true, some characters are invisible which could end up causing issues.

For a true byte by byte conversion the only reliable option is to use binary. for our user case given we use MySQL, the best option would be to have a VARBINARY(254) (binary fields don't have encoding). After that it would be easy to simply:

INSERT into user_table set email_address='@.';
SELECT * FROM user_table where email_address = '@.';

To be safe, values can also be HEX('') on the client of application side if needed. This is truly the most efficient solution for this problem given that you will only store email address in a 254 bytes column which is by RFC standard the maximum length with any encoding.

Answer for "MySQL index limit to 767 bytes"

It looks like InnoDB large prefixes is now the default configuration on MySQL >= 5.7.7 since it was mostly a backward compatible setting. While one could implement this complex UTF-8 to HTML-ENTITIES conversion, it probably make more sense to just upgrade MySQL when using a UTF-8 email address as a primary key. Or one could also simply enable large prefixes in the MySQL configuration for MySQL <= 5.7.7:

innodb_large_prefix=on
innodb_file_format=barracuda

Conslusion

Keep in mind that while some providers supports UTF-8 in email addresses, it is still not mainstream in 2016. In the meantime there are a few options to store the information but less to make sure it will make it to its destination.

Nicolas Bouvrette
  • 4,295
  • 1
  • 39
  • 53
  • I don't know the email standard, but `` takes 4-bytes, but they are not "Ascii". Encoding with htmlentities (or unicode escaping or ...) will take more than 4 _ascii_ characters. – Rick James Aug 06 '16 at 22:23
  • 5.7, by default, avoids the 191 kludge mentioned in another answer. The 767 _byte_ limit was raised. – Rick James Aug 06 '16 at 22:24
  • @RickJames I should have started by reading on this. I did fix the 767 in my MySQL server years ago but I thought it was a non-standard approach... if I would have searched this aspect first I might have avoided the question but wouldn't have learned as much on encoding today... – Nicolas Bouvrette Aug 06 '16 at 22:26
  • @RickJames Also, the take 4 bytes in length, ASCII takes 1 byte per character. This is how we can estimate the maximum UTF-8 length in ASCII even while it's not possible to display in ASCII. – Nicolas Bouvrette Aug 06 '16 at 22:27
  • 4 bytes is a *maximum* UTF-8 size, but UTF-8 characters can be 1,2, or 3 bytes as well. – Martin Aug 07 '16 at 13:34
  • @Martin by the way do you know why when I tried to convert UTF-8 to an other 8 bits (1 byte per char) encoding (e.g. ISO-8859- 1) I didn't seem to get equivalent length in my strings? The original idea was to convert bit by bit so that I could convert into a 254 bytes column in MySQL but I couldnt find a way to do this. In theory it should work unless I missed something? – Nicolas Bouvrette Aug 07 '16 at 14:16
  • Have you tried reading [UTF-8 All the way through](http://stackoverflow.com/questions/279170/how-to-support-utf-8-completely-in-a-web-application)? Answers are too big for comments here but I have found in the past a few very good, very long answers about Character sets across Stack Overflow.... – Martin Aug 07 '16 at 14:31
  • @Martin Yes UTF-8 works well. The problem is on the MySQL level. If I provision a UTF-8 witha length of 254, it will provision 254 x 4 (1016 bytes) when actually the email limit is truly 254 bytes. It doesn't seem an issue on recent versions of MySQL but I wonder how it can impact performance and storage on the long run. – Nicolas Bouvrette Aug 07 '16 at 14:35
  • Performance and storage are issues that I don't know enough about as specific issues, it all depends on your dataset size, shard sizing and general indexing and read/write activities. Lots of small tables (as per my example email table, etc.) are better than one large table. Perhaps have a nosey around [Database Administrators Stack Exchange](http://dba.stackexchange.com/) for deeper knowledge of these things :) – Martin Aug 07 '16 at 14:38
  • @Martin great suggestion, I will. – Nicolas Bouvrette Aug 07 '16 at 14:41
0

You cannot "convert" a UTF8 string to ASCII at the same length if the characters do not have an ASCII representation as in your example.

What you could do is to create some kind of representation of the bytecodes that make up the UTF8 characters. I doubt that would be useful as email address though.

UPDATE

In UTF8 each character can consume multiple bytes. how many varies by character. If ASCII one character is one byte. So you can use each byte of the UTF8 character and see chat character that byte represents in ASCII. However- this will have absolutely nothing to to with the original UTF8 character except for those UTF8 characters that are represented by a single byte. IMHO those will match their ASCII representation.

andig
  • 13,378
  • 13
  • 61
  • 98
  • I'm not sure I understand, how can you calculate the strlen of 14 if you can never convert it to that string?? Bytecode could be useful if I can convert it back and predict its maximum length - I'm actually thinking HTML-ENTITIES might make more sense. If I'm not mistaken the maximum ASCII length in this output per UTF-8 character is 10 while strlen will calculate 6 (40% increase maximum). – Nicolas Bouvrette Aug 06 '16 at 17:35
  • Still not sure I understand your UPDATE - I'm not an encoding expert but even if the ASCII output is not human readable, I would expect to be able to get some sort of 14 "characters" ASCII version and store it to a database and convert it back? – Nicolas Bouvrette Aug 06 '16 at 17:42
  • ASCII bytes cover 0x00 through 0x7f. UTF-8 can have bytes all the way to 0xff. Trying to put a value of 0x80 through 0xff in an ASCII byte won't work, because then it's not ASCII. – Ignacio Vazquez-Abrams Aug 06 '16 at 17:55
  • @IgnacioVazquez-Abrams That make sense but does the value (14) returned by `strlen` in PHP make sense then? – Nicolas Bouvrette Aug 06 '16 at 18:00
  • Yes, because `strlen()` performs a naive string length checking, counting each byte as one character. Since "" uses 4 bytes in UTF-8, the total is 14 bytes. – Ignacio Vazquez-Abrams Aug 06 '16 at 18:38