0

As part of a project, we needed to move from Ubuntu 14.04 to Ubuntu 16.04. However, since the upgrade was completed, full functionality has not been working correctly. The encoding of the characters is being jumbled when stored in the database. The same debian version of the software produces different results, implying an ISO issue with a different library or some differences in Java behaviour.

The upgraded server is experiencing no problems and it persists only on newer installs, which implies an issue at the ISO level, but there is no obvious sign as to which library or similar may have failed to install.

Logging was added to print the bytes received, and Java still reads this as it would be expected. However, when it stores them in the database, they are completely different. This is done via a JPA connection setup earlier. This is already using the 'useUnicode=true&characterEncoding=UTF-8' field. When Java reads this data again, it still thinks it is using the correct bytes when it is not. Likewise, if you add something directly to the DB, Java's debugging logs do not show the correct bytes, yet the information is still shown correctly when displayed via the interface which could only have passed through here. This implies the issue is with storing the data rather than handling of it, but the same version of the debian install affects both versions. The working version reads the bytes correctly when it gets them out of the database.

شلاؤ, in Arabic for example is supposed to be encoded as (by using hex function in mysql/mariadb), comes out, in the correct version as "D8B4D984D8A7D8A4" BUT in the incorrect version, displays as "C398C2B4C399C284C398C2A7C398C2A4". This may provide more information as to why the encoding is failing to work correctly. With Java reading the incorrect bytes as if they are correct, this is more likely to be an issue with Java, but the confusion remains due to the inconsistency between systems.

  • It seems the correct text (probably already in UTF-8) was UTF-8 encoded from some single byte encoding(multibytes sequences for single chars makes the text longer). If a database dump/script was used to copy the database, then therein lies the problem. – Joop Eggen Jun 06 '19 at 09:38
  • The database has not been copied. This database issue occurs when new data is received (via the server's gateway). But agreed, double encoding of some description is a likely cause, but working out a solution appears to be the concern... – Alex Watson Jun 06 '19 at 10:19
  • Sorry that was the easiest error cause. There are many defaults for encodings in mariadb/mysql, DDL & config, on different levels: the database, the table, the column, and even the transmission (your useUnicode). – Joop Eggen Jun 06 '19 at 10:28

2 Answers2

0

D8B4D984D8A7D8A4 is the correct utf8 (or utf8mb4) encoding for شلاؤ. C398C2B4C399C284C398C2A7C398C2A4 is the "double-encoded" version. This implies that something is still specifying "latin1" as the character set. Perhaps you dumped and reloaded the data, and that is where it happened?

For more on such, see Trouble with UTF-8 characters; what I see is not what I stored and perhaps http://mysql.rjweb.org/doc.php/charcoll

Rick James
  • 135,179
  • 13
  • 127
  • 222
0

For anyone who may be experiencing something similar, the result turned out the be that Java was running without defaulting to utf8. OpenEJB/JPA was configured correctly, as was the database, but one aspect of the server was defaulting to a different charset, so the startup arguments for the affected area resolved the problem!