0

Here are the hex values of two strings stored in a MySQL database using two different methods. 20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5

and

E0A495E0A4BEE0A49AE0A48220E0A4B6E0A495E0A58DE0A4A8E0A58BE0A4AEE0A58DE0A4AFE0A4A4E0A58DE0A4A4E0A581E0A4AEE0A58D20E0A5A420E0A4A8E0A58BE0A4AAE0A4B9E0A4BFE0A4A8E0A4B8E0A58DE0A4A4E0A4BF20E0A4AEE0A4BEE0A4AEE0A58D20E0A5A5

They represent the string काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥. The former appears to be encoded badly, but works in the application, the latter appears encoded correctly but does not. I need to be able to create the first hex string from the input.

Here comes the long version: I've got a legacy application built in PHP/MySQL. The database connection charset is latin1. The charset of the table is utf8 (don't ask). The input is coerced into being correct utf8 via the ForceUTF8 composer library. Looking directly in the database, the stored value of this string is काचं शकà¥à¤¨à¥‹à¤®à¥à¤¯à¤¤à¥à¤¤à¥à¤®à¥ । नोपहिनसà¥à¤¤à¤¿ मामॠ॥

I am aware that this looks horrendous and appears to me to be badly encoded, however it is out of scope to fix the legacy application. The rest of the application is able to cope with this data as it is and everything else works and displays perfectly well with it.

I have created an external node application to replace the current insert routine running on Azure. I've set the connection charset to latin1, it's connecting to the same database and running the same insert statement. The only part of the puzzle I've not been able to replicate is the ForceUTF8 library as I could find no equivalent in the npm ecosystem. When the same string is inserted it renders perfectly when looking at the raw field in PHP Storm i.e. it looks exactly like the original text above, and the hex value of the string is the latter of the two presented at the top of the question. However, when viewed in the application the values are corrupted by question marks and black diamonds.

If, within the PHP application, I run SET NAMES utf8 ahead of the rendering data query then the node-inserted values render correctly, and the legacy ones now display as corrupted. Adding set names utf8 to the application for this query is not an acceptable solution since it breaks the appearance of the legacy data, and fixing the legacy data is also not an acceptable solution.

I have tried all sorts of connection charsets and various Iconv functions to make the data exactly match how the legacy app makes it but have not been able to "break it" in exactly the same way.

How can I make "काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥" into a string, the hex value of which is "20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5" using some variation of database connection charset and string conversion?

ADyson
  • 57,178
  • 14
  • 51
  • 63
ThisLeeNoble
  • 150
  • 9
  • Where is the PHP/Application code for displaying old `badly encoded data`? – जलजनक Mar 02 '22 at 17:49
  • 1
    The 'horrendous" string is `काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥` encoded in UTF-8 w/ BOM, decoded with something similar to ISO-8859-1 but not exactly, and encoded in UTF-8 again, plus there is an extra space (hex 20) at the front. Using ISO-8859-1 I get `C3AFC2BBC2BFC3A0C2A4C295C3A0C2A4C2BEC3A0C2A4C29AC3A0C2A4C28220C3A0C2A4C2B6C3A0C2A4C295C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5C28BC3A0C2A4C2...` which is close but not quite right. I pity you for having to deal with this mess – Mark Tolonen Mar 02 '22 at 17:55
  • The intermediate decoding is `Windows-1252` but with the undefined bytes left in. Similar to `ISO-8859-1` but with some replacements like the Euro and smart quotes. I'm familiar with PHP but could give a Python solution that proves it anyway. – Mark Tolonen Mar 02 '22 at 18:25

2 Answers2

1

I'm not familiar with PHP, but I was able to generate the "horrendous" encoding with Python (and it is horrendous...not sure how someone intentionally generated this crap). Hopefully this guides you to a solution:

import re

expected = '20C3AFC2BBC2BFC3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A20C3A0C2A4C2B6C3A0C2A4E280A2C3A0C2A5C28DC3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AEC3A0C2A5C28DC3A0C2A4C2AFC3A0C2A4C2A4C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A5C281C3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A420C3A0C2A4C2A8C3A0C2A5E280B9C3A0C2A4C2AAC3A0C2A4C2B9C3A0C2A4C2BFC3A0C2A4C2A8C3A0C2A4C2B8C3A0C2A5C28DC3A0C2A4C2A4C3A0C2A4C2BF20C3A0C2A4C2AEC3A0C2A4C2BEC3A0C2A4C2AEC3A0C2A5C28D20C3A0C2A5C2A5'
original = 'काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥'

# Encode in UTF-8 w/ BOM  (U+FEFF encoded in UTF-8 as a signature)
step1 = original.encode('utf-8-sig')

# Windows-1252 doesn't define some byte -> codepoint mappings and Python normally
# raises an error on those bytes.  Use an error handler to keep the bytes that
# fail, then replace the escape codes with the matching Unicode codepoint.
step2 = step1.decode('cp1252',errors='backslashreplace')
step3 = re.sub(r'\\x([0-9a-f]{2})', lambda x: chr(int(x.group(1),16)), step2)

# There is an extra space before the UTF-8-encoded BOM for some reason
step4 = ' ' + step3

step5 = step4.encode('utf8')

# Format to match expected string
final = step5.hex().upper()

print(final == expected)  # True
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Thanks for trying to help. My Python is probably worse than your PHP, however the solution I'm looking for would be in node js. Looking at this it seems python has a string method that encodes and decodes to any charset. That's nice. Not sure if similar exists in node though. I've had a bit of a try with no success so far, but will keep trying tomorrow (on UK time). – ThisLeeNoble Mar 02 '22 at 21:52
  • @ThisLeeNoble Hopefully someone with experience with both languages can translate. At least the algorithm works. How this mess usually happens is someone read a UTF-8 /w BOM-encoded file but decoded it as code page 1252, then wrote that mis-decoded data to a UTF-8 database. Not sure where the extra space would come from. – Mark Tolonen Mar 03 '22 at 01:36
  • I added to my Answer to talk about some of the things mentioned in this Answer+Commments. Note: Python has configuration settings in `MySQLdb.connect`, `* -*- coding`, `encoding='utf-8-bom'`, etc. – Rick James Mar 04 '22 at 15:42
  • I think I'm getting closer to replicating this in node. I'm just not familiar with the libraries and mapping this process from the python is a struggle. I'll post the ultimate solution should I get there, not that it's going to help anybody else. I'm actually more intrigued and would find it more useful to know more about the process that was used to work out the encoding sequence. It was surely not a brute forcing of different possibilities. – ThisLeeNoble Mar 08 '22 at 08:57
0

HEX('काचं') = 'E0A495E0A4BEE0A49AE0A482' -- utf8mb4 to utf8mb4 hex

HEX(CONVERT(CONVERT(BINARY('काचं') USING latin1) USING utf8mb4)) = 'C3A0C2A4E280A2C3A0C2A4C2BEC3A0C2A4C5A1C3A0C2A4E2809A' is utf8mb4 to double-encoded

See "double-encoding" in Trouble with UTF-8 characters; what I see is not what I stored

More

"Double-encoding", as I understand it, is where utf8 bytes (up to 4 bytes per "character") are treated as latin1 (or cpnnnn) and converted to utf8, and then that happens a second time. In this case, each 3-byte Devanagari is converted twice, leading to between 6 and 9 bytes.

You explained the cause here:

The database connection charset is latin1. The charset of the table is utf8

BOM is, in my opinion, a red herring. It was intended to be a useful clue that a "text" file was encoded in UTF-8, but unfortunately, very few products generate it. Hence, BOM is more of a distraction than a help. (I don't think MySQL has any way to take care of BOM -- after all, most database activity is at the row level, not the file level.)

The solution (for the data flow) in MySQL context is to rip out all "conversion" functions and, instead, configure things so that MySQL will convert at the appropriate places. Your mention of "latin1" was the main "mis-configuration".

The long expression (HEX...) gives a clue of how to fix the data, but it must be coordinated with changes to configuration and changes to code.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • It's not double encoded though, well not as simple as double-encoded. Your final hex is `C3A0C2A4...` whereas the expected is `C3AFC2BB...` – ThisLeeNoble Mar 07 '22 at 14:58