0

I'm developing an application where people around the world enter an address, city or something else in a search box. Then they can select the results that match their target. The selected result contains text from the address.components long_name.

some examples returned by the geocoder API:

"long_name" : "King's Street",
"short_name" : "King's St",
"types" : [ "route" ]

"long_name" : "Newport",
"short_name" : "Newport",
"types" : [ "postal_town" ]

"long_name" : "Staffordshire",
"short_name" : "Staffordshire",
"types" : [ "administrative_area_level_2", "political" ]

In this case I would e.g. store:

"King's Street"

"Newport"

"Staffordshire"

into my database.

Then... this application can store from locations from all countries, and in potentially all official native languages used in those countries - by google in the "long_name" strings. Note that I set both the country and the language in the geocoder, in order to both show the map in the native language of the user, as well as getting back the result (address.components strings) in the correct language for the user.

Does anyone know if the address.components long_names can be stored precisely (character set wise) when using UTF-8 in MySql (that is a 3-byte UNICODE), or if I need to use the utf8mb4 charset (4-byte UNICODE)?

If I need to use the utf8mb4 charset, what is the reason for that? Which languages stored by Google Geocoder require utf8mb4 (4-byte) UNICODE, in order to not loose any character/language information when storing in the database?

Community
  • 1
  • 1
bongobongo
  • 91
  • 1
  • 2
  • 6
  • 1
    Anything that says it's UTF-8 is as a standard 4-byte UTF-8. MySQL is the exception that they went initially for 3bytes. It is highly recommended you use `utf8mb4` as much as possible over the 3byte version. [This StackOverflow Post](http://stackoverflow.com/questions/279170/utf-8-all-the-way-through) should be very helpful to you too. – Martin Jun 30 '16 at 19:01
  • I checked the post you linked to. That person want to support UNICODE all the way. My approach is somewhat different. I need to support what I have to support. If google geocoder does not return any character sets (ref. what geocoder uses in address.components long_name) which require 4-byte UTF-8, then I see no reason to use utf8mb4 charset, since the only thing it will result in is: a) much more data in database b) larger indexes which again result in slower queries, and more resources used on the server. Is there any documentation showing which char-sets geocoder uses? – bongobongo Jun 30 '16 at 21:30
  • If I was running this choice, I would always choose `utf8mb4` because using any other `UTF8_` MySQL charset is simply waiting for the same problem to come back and bite you on another day. I have no idea what geocoder uses but UTF8 is a defacto web standard character set now. And the size of datasets (unless they're really massive) will not effect indexes or data retrieval speeds. MySQL is good upto and exceeding billions of rows of data. – Martin Jun 30 '16 at 21:36
  • Also if any of your data returned by the geocoder is locale specific such as place names in far flung parts of the world, then these characters will be lost and mangled in a 3byte UTF8 storage. Seriously, save yourself arseache down the line and stop the problem before it becomes a problem now. :-) . Use `UTF8mb4` – Martin Jun 30 '16 at 21:40

2 Answers2

1

If your application is a greenfield job (a new application) and you're on a recent version of MySQL or MariaDb, you should use utf8mb4. It will handle everything in Unicode including some obscure character sets, and you won't have to give this question another thought.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • Appreciate the quick response. But to narrow my question further: Does any of the used address.components long_name strings returned from the geocoder make use of characters that require more than 3-byte UTF8? If that is the case, which languages in particular (returned by google geocoder) causes me to use a 4 byte UTF8? The reason I'm asking si quite simple. There is going a lot of data into the database, and I want to use as little DB-space as possible. And if I can get away with 3-byte UTF8, then my indexes will also be smaller, and things will probably also be faster query wise. – bongobongo Jun 30 '16 at 20:41
  • Only Google Maps people know the answer to your precise question. If you're going to scrape enough information to make a difference, you have probably already paid for commercial access. So, ask your Google account manager to find out for you. If you haven't paid for access, they will throttle your scraping, and disk drive prices will come down before you see even the slightest impact from this. – O. Jones Jun 30 '16 at 22:13
0

The comments imply that the real question is about the size of 3-byte utf8 and 4-byte utf8mb4. (I assume you are using VARCHAR or TEXT?

  • For English, there is no difference -- each character takes 1 byte in either utf8 or utf8mb4. Neither the size, nor the encoding differs.
  • For Europe, there is no difference -- each character takes 1 or 2 bytes.
  • For most of the languages of Asia, there is no difference -- each character takes exactly 3 bytes.
  • For Chinese, there is a problem -- Some Chinese characters need 4 bytes, storing such data in a utf8 column will lead to truncation or other mangling.

So, you may as well use utf8mb4 for everything.

To every application but MySQL, "UTF-8" refers to the variable length encoding; it can even be longer than 4 bytes (although no characters have yet been assigned codes longer than 4 bytes).

Rick James
  • 135,179
  • 13
  • 127
  • 222