4

I'm seeking for a MySQL function to convert every special latin character (not to be confused with latin1 charset - I'm using utf8) to its non-latin equivalent from the standard A-Z letters range.

Here's what I literally want to achieve:

'Zażółć gęślą jaźń' --> 'Zazolc gesla jazn'

I've tried many different stunts with CONVERT and CAST functions. No success so far.

I've also googled very deeply, unfortunately every single problem with MySQL related to character conversion refers to the typical latin1 default collation which is not what I'm looking for.


My far goal here is to convert product names to uri friendly strings on the fly, via TRIGGER ... AFTER INSERT, e.g.

Crème Brûlée --> creme_brulee

Those converted names will serve not only as SEO URIs but also as references to thumbnail's file names.

Besides LOWER(@str) and REPLACE(@str, ' ', '_'), one of the steps is to flatten those latin characters.

I'd prefer to keep those converted names in the database, although maybe you know some more proper way in the architectural point of view?

matewka
  • 9,912
  • 2
  • 32
  • 43
  • If they had already added normalization functions to MySQL, they could most likely be used for this, since with them you can split the characters into base form and diactirics. After that filtering the diacritics away would be very simple. Unfortunately it's not implemented yet. – Sami Kuhmonen Apr 19 '15 at 10:19
  • @SamiKuhmonen thanks, good to know at least this. – matewka Apr 19 '15 at 10:30

1 Answers1

3

Well, Sami Kuhmonen's comment on my question practically depleted all possibilities.

But I had to go with any working solution, so I decided to copy the PHP Doctrine's Doctrine_Inflector:unaccent method from GitHub. MySQL implementation definetely doesn't look handsome but I guess nesting REPLACE() functions is the only way to perform a multi-string-replacement in MySQL.

Here's the whole TRIGGER:

 CREATE DEFINER = CURRENT_USER TRIGGER `product_BEFORE_INSERT` BEFORE INSERT ON `product` FOR EACH ROW

 # new value
 SET NEW.name_uri = LOWER(

 REPLACE( 

 # ugly set of nested replaces
 REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
 REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
 REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
 REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
 REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
 REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
 REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
 REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
 REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
 REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
 REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
 REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
 REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(

 # the original value
 NEW.name,

 # every accented latin character represented by its ASCII code
 # is replaced with normalized letter
 CHAR(195,128), 'A'),  CHAR(195,129), 'A'), CHAR(195,130), 'A'),  CHAR(195,131), 'A'), CHAR(195,132), 'A'),  CHAR(195,133), 'A'),
 CHAR(195,135), 'C'),  CHAR(195,136), 'E'), CHAR(195,137), 'E'),  CHAR(195,138), 'E'), CHAR(195,139), 'E'),  CHAR(195,140), 'I'),
 CHAR(195,141), 'I'),  CHAR(195,142), 'I'), CHAR(195,143), 'I'),  CHAR(195,145), 'N'), CHAR(195,146), 'O'),  CHAR(195,147), 'O'),
 CHAR(195,148), 'O'),  CHAR(195,149), 'O'), CHAR(195,150), 'O'),  CHAR(195,153), 'U'), CHAR(195,154), 'U'),  CHAR(195,155), 'U'),
 CHAR(195,156), 'U'),  CHAR(195,157), 'Y'), CHAR(195,159), 's'),  CHAR(195,160), 'a'), CHAR(195,161), 'a'),  CHAR(195,162), 'a'),
 CHAR(195,163), 'a'),  CHAR(195,164), 'a'), CHAR(195,165), 'a'),  CHAR(195,167), 'c'), CHAR(195,168), 'e'),  CHAR(195,169), 'e'),
 CHAR(195,170), 'e'),  CHAR(195,171), 'e'), CHAR(195,172), 'i'),  CHAR(195,173), 'i'), CHAR(195,174), 'i'),  CHAR(195,175), 'i'),
 CHAR(195,177), 'n'),  CHAR(195,178), 'o'), CHAR(195,179), 'o'),  CHAR(195,180), 'o'), CHAR(195,181), 'o'),  CHAR(195,182), 'o'),
 CHAR(195,182), 'o'),  CHAR(195,185), 'u'), CHAR(195,186), 'u'),  CHAR(195,187), 'u'), CHAR(195,188), 'u'),  CHAR(195,189), 'y'),
 CHAR(195,191), 'y'),
 # Decompositions for Latin Extended-A
 CHAR(196,128), 'A'),  CHAR(196,129), 'a'), CHAR(196,130), 'A'),  CHAR(196,131), 'a'), CHAR(196,132), 'A'),  CHAR(196,133), 'a'),
 CHAR(196,134), 'C'),  CHAR(196,135), 'c'), CHAR(196,136), 'C'),  CHAR(196,137), 'c'), CHAR(196,138), 'C'),  CHAR(196,139), 'c'),
 CHAR(196,140), 'C'),  CHAR(196,141), 'c'), CHAR(196,142), 'D'),  CHAR(196,143), 'd'), CHAR(196,144), 'D'),  CHAR(196,145), 'd'),
 CHAR(196,146), 'E'),  CHAR(196,147), 'e'), CHAR(196,148), 'E'),  CHAR(196,149), 'e'), CHAR(196,150), 'E'),  CHAR(196,151), 'e'),
 CHAR(196,152), 'E'),  CHAR(196,153), 'e'), CHAR(196,154), 'E'),  CHAR(196,155), 'e'), CHAR(196,156), 'G'),  CHAR(196,157), 'g'),
 CHAR(196,158), 'G'),  CHAR(196,159), 'g'), CHAR(196,160), 'G'),  CHAR(196,161), 'g'), CHAR(196,162), 'G'),  CHAR(196,163), 'g'),
 CHAR(196,164), 'H'),  CHAR(196,165), 'h'), CHAR(196,166), 'H'),  CHAR(196,167), 'h'), CHAR(196,168), 'I'),  CHAR(196,169), 'i'),
 CHAR(196,170), 'I'),  CHAR(196,171), 'i'), CHAR(196,172), 'I'),  CHAR(196,173), 'i'), CHAR(196,174), 'I'),  CHAR(196,175), 'i'),
 CHAR(196,176), 'I'),  CHAR(196,177), 'i'), CHAR(196,178), 'IJ'), CHAR(196,179), 'ij'), CHAR(196,180), 'J'),  CHAR(196,181), 'j'),
 CHAR(196,182), 'K'),  CHAR(196,183), 'k'), CHAR(196,184), 'k'),  CHAR(196,185), 'L'), CHAR(196,186), 'l'),  CHAR(196,187), 'L'),
 CHAR(196,188), 'l'),  CHAR(196,189), 'L'), CHAR(196,190), 'l'),  CHAR(196,191), 'L'), CHAR(197,128), 'l'),  CHAR(197,129), 'L'),
 CHAR(197,130), 'l'),  CHAR(197,131), 'N'), CHAR(197,132), 'n'),  CHAR(197,133), 'N'), CHAR(197,134), 'n'),  CHAR(197,135), 'N'),
 CHAR(197,136), 'n'),  CHAR(197,137), 'N'), CHAR(197,138), 'n'),  CHAR(197,139), 'N'), CHAR(197,140), 'O'),  CHAR(197,141), 'o'),
 CHAR(197,142), 'O'),  CHAR(197,143), 'o'), CHAR(197,144), 'O'),  CHAR(197,145), 'o'), CHAR(197,146), 'OE'), CHAR(197,147), 'oe'),
 CHAR(197,148), 'R'),  CHAR(197,149), 'r'), CHAR(197,150), 'R'),  CHAR(197,151), 'r'), CHAR(197,152), 'R'),  CHAR(197,153), 'r'),
 CHAR(197,154), 'S'),  CHAR(197,155), 's'), CHAR(197,156), 'S'),  CHAR(197,157), 's'), CHAR(197,158), 'S'),  CHAR(197,159), 's'),
 CHAR(197,160), 'S'),  CHAR(197,161), 's'), CHAR(197,162), 'T'),  CHAR(197,163), 't'), CHAR(197,164), 'T'),  CHAR(197,165), 't'),
 CHAR(197,166), 'T'),  CHAR(197,167), 't'), CHAR(197,168), 'U'),  CHAR(197,169), 'u'), CHAR(197,170), 'U'),  CHAR(197,171), 'u'),
 CHAR(197,172), 'U'),  CHAR(197,173), 'u'), CHAR(197,174), 'U'),  CHAR(197,175), 'u'), CHAR(197,176), 'U'),  CHAR(197,177), 'u'),
 CHAR(197,178), 'U'),  CHAR(197,179), 'u'), CHAR(197,180), 'W'),  CHAR(197,181), 'w'), CHAR(197,182), 'Y'),  CHAR(197,183), 'y'),
 CHAR(197,184), 'Y'),  CHAR(197,185), 'Z'), CHAR(197,186), 'z'),  CHAR(197,187), 'Z'), CHAR(197,188), 'z'),  CHAR(197,189), 'Z'),
 CHAR(197,190), 'z'),  CHAR(197,191), 's'),

 # Euro Sign
 CHAR(226,130,172), 'E'),
 # GBP (Pound) Sign
 CHAR(194,163), ''),
 'Ä', 'Ae'), 'ä', 'ae'), 'Ü', 'Ue'), 'ü', 'ue'),
 'Ö', 'Oe'), 'ö', 'oe'), 'ß', 'ss'),
 # Norwegian characters
 'Å', 'Aa'), 'Æ', 'Ae'), 'Ø', 'O'), 'æ', 'a'), 'ø', 'o'), 'å', 'aa')

 , ' ', '_'));
matewka
  • 9,912
  • 2
  • 32
  • 43
  • It is so sick `REPLACE`-everything approach that i'm voting it up :D Why don't you use iconv on the code-side? I'm not sure if your `TRIGGER` will cover **all** special chars which are defined in UTF-8. – Paweł Tomkiel Apr 19 '15 at 12:21
  • @Paul, thanks for your concerns. I don't want to use any kind of code-side solutions because all database operations are performed directly via SQL. I don't incorporate any kind of CMS (yet) and most of the products will be inserted during deployment on production. When it comes to all character coverage, the website is in Polish language and it is a confectionery so I only need Polish letters and French ones for some typical cake names. – matewka Apr 19 '15 at 13:45