2

I have some product names that include unicode characters

⚠️PLEASE READ! WORKING KODAK DC215 ZOOM 1.0MP DIGITAL CAMERA - UK SELLER

A query in heidiSQL shows it fine

enter image description here

I setup MariaDB new this morning having moved from MySQL, but when records are retrieved through a ColdFusion Query using the MariaDB JDBC I get

java.lang.StringIndexOutOfBoundsException: begin 0, end 80, length 74
    at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3410)
    at java.base/java.lang.String.substring(String.java:1883)
    at org.mariadb.jdbc.internal.com.read.resultset.rowprotocol.TextRowProtocol.getInternalString(TextRowProtocol.java:238)
    at org.mariadb.jdbc.internal.com.read.resultset.SelectResultSet.getString(SelectResultSet.java:948)

The productname field collation is utf8mb4_unicode_520_ci, I've tried a few options. I've tried to set this at table and database level where it let me.

The JDBC connection string in ColdFusion admin is jdbc:mysql://localhost:3307/usedlens?useUnicode=true&characterEncoding=UTF-8

I note that the live production database where MariaDB was used from the beginning I don't have this trouble but the default charset is latin1, and the same record is the database as

????PLEASE READ! WORKING KODAK DC215 ZOOM 1.0MP DIGITAL CAMERA - UK SELLER

Daniel Cook
  • 1,033
  • 1
  • 9
  • 19
  • See "question mark" in https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored – Rick James Aug 04 '20 at 01:20
  • Were there exactly 4 question marks? I would expect 10 for those 2 Emoji: hex: `E29AA0 EFB88F F09F93B7` – Rick James Aug 04 '20 at 01:22
  • @RickJames I copied the text as it was, so yes 4 ????, it was the same product imported on prod as dev. incidentally if i didn't make it clear before my database is handling them fine, see added screenshot, it's just the jdbc used by coldfusion which is throwing an error. – Daniel Cook Aug 04 '20 at 06:46
  • I have never used MariaDB, but shouldn't you specify `characterEncoding=utf8mb4`? – Mark Rotteveel Aug 04 '20 at 07:06
  • I tried that no difference, and i just tried no to no avail `?useEncoding=true&characterEncoding=UTF-8&sessionVariables=character_set_client=utf8mb4,character_set_connection=utf8mb4,character_set_results=utf8mb4,collation_connection=utf8mb4_general_ci` – Daniel Cook Aug 04 '20 at 07:17
  • So, ⚠️ is part of the system error message? And is part of your text? – Rick James Aug 04 '20 at 15:07
  • @MarkRotteveel - `useUnicode=yes&characterEncoding=UTF-8` -- These are for the Java side. utf8mb4 is for the MariaDB side. – Rick James Aug 04 '20 at 15:10
  • @RickJames both of those icons are unicode emoticons included in the product name. That is the text copied straight from the database (see db screenshot). Those are what breaks a ColdFusion query via the JDBC. However see I've posted below, I have decided to just strip them out on import now so I don't need to handle them. – Daniel Cook Aug 04 '20 at 15:43
  • 1
    FYI: I use CF2016 with the Microsoft JDBC SQL driver and a NVARCHAR field and emojis are correctly stored (using default Latin collation). We also support multiple foreign languages in this same column and haven't had any query issues. I wonder if this is a strictly JDBC issue. – James Moberg Aug 05 '20 at 17:26

3 Answers3

1

This is due to a sequence of high ASCII characters that form emojis. I encountered similar issues when exporting MSSQL data to a UTF-8 file to be converted to Excel using a 3rd party tool. In this case, the database and file were correct, but the 3rd party tool would crash when encountering emoji characters.

Our approach to this was to convert emojis to their aliases so that information wasn't lost in the process. (If you strip high ASCII characters, you may lose some context.) To sanitize emojis to use aliases, I wrote this ColdFusion cf-emoji-java (CFC) to leverage emoji-java (JAR file) to convert emojis to their ASCII7-safe aliases.

emojijava = new emojijava();
emojijava.parseToAliases('I like ');   // I like :pizza:
James Moberg
  • 4,360
  • 1
  • 22
  • 21
  • Removing the emojis might be my only option, and thanks for this. But as you see in added screenshot the database does handle them fine, it's only the JDBC driver in ColdFusion that's throwing a error, I would have thought it'd been solved by correct configuration somewhere. – Daniel Cook Aug 04 '20 at 06:51
  • Emojis don't make sense for a product field. If the text you are expecting should be ASCII7, you may be able to strip them entirely. If these are user comments, some context or expression may be lost depending on how the emoticon is used. – James Moberg Aug 05 '20 at 17:29
1

Here's how we've been stripping high ASCII characters while retaining any characters that may be salvaged:

string function ASCIINormalize(string inputString=""){
    return createObject( 'java', 'java.text.Normalizer' ).normalize( javacast("string", arguments.inputString) , createObject( 'java', 'java.text.Normalizer$Form' ).valueOf('NFD') ).replaceAll('\p{InCombiningDiacriticalMarks}+','').replaceAll('[^\p{ASCII}]+','');
}

productname = ASCIINormalize(productname);

/*
Comparisons using java UDF versus reReplace regex:

"ABC   Café ’test"  (note: High ASCII non-normal whitespace characters used.)
   ASCIINormalize = "ABC Cafe test"
   reReplace = "ABC Caf test"

"čeština"
   ASCIINormalize = "cestina"
   reReplace = "etina"

"Häuser Bäume Höfe Gärten"
   ASCIINormalize = "Hauser Baume Hofe Garten"
   reReplace = "Huser Bume Hfe Grten"
*/
James Moberg
  • 4,360
  • 1
  • 22
  • 21
  • NOTE: The above solution is from https://stackoverflow.com/a/11642205/693068 where I also recommend using a 3rd-party java library called JUnidecode. – James Moberg Aug 05 '20 at 17:48
0

Since...

  • I'm not really in the business of supporting emojis
  • My data is just product names targeted at UK, Europe and the United States for the foreseeable future
  • I don't want to have to go through the same trouble with production (already defaulted to latin1_swedish_ci)

I decided to..

Match production, so I set the database, table, and fields to latin1_swedish_ci with help from

How to change the CHARACTER SET (and COLLATION) throughout a database?

and strip non ASCII characters in the product name

== edit don't do this, it takes out too many useful characters ==

<cfset productname = reReplace(productname, "[^\x20-\x7E]", "", "ALL")>
Daniel Cook
  • 1,033
  • 1
  • 9
  • 19