Perl issue when encoding mysql data from UTF-8 to UCS-2 for SMPP

Question

I am trying to fetch UTF-8 accentuated characters "é" "ê" from mysql and convert them to UCS-2 when sending over SMPP. The data is stored as utf8_general_ci and I perform the following when opening the DB connection:

$dbh->{'mysql_enable_utf8'}=1;
$dbh->do("set NAMES 'utf8'");

If I test the sending part by hard coding the string value with "é" "ê" using data_encoding=8, it goes through perfectly. However if I comment out the first line and just use what comes from the DB, it fails. Also, if I try to send the characters using the DB and setting data_encoding=3, it also works fine, but then the "ê" would not appear, which is also expected. Here is what I use:

$fred = 'éêcole'; <-- If I comment out this line, the SMPP call fails
$fred = decode('utf-8', $fred);
$fred = encode('UCS-2', $fred);

$resp_pdu = $short_smpp->submit_sm(
        source_addr_ton => 0x00,
        source_addr_npi => 0x01,
        source_addr => $didnb,
        dest_addr_ton => 0x01,
        dest_addr_npi => 0x01,
        destination_addr => $number,
        data_coding => 0x08,
        short_message => $fred
) or do {
        Log("ERROR: submit_sm indicated error: " . $resp_pdu->explain_status());
        $success = 0;
};

The different values for the data_coding fields are the following: Meaning of "data_coding" field in SMPP

00000000 (0) - usually GSM7
00000011 (3) for standard ISO-8859-1
00001000 (8) for the universal character set -- de facto UTF-16

The SMPP provider's documentation also mentions that special characters should be handled via UCS-2: https://community.sinch.com/t5/SMS-365-enterprise-service/Handling-Special-Characters/ta-p/1137

How should I prepare the data that is coming out of the DB to make this SMPP call work?

I am using Perl v5.10.1

Thanks !

The `decode('utf-8', $fred)` looks suspicious to me. Isn't the point of `$dbh->{'mysql_enable_utf8'}=1;` to decode the returned values? If so, the fix is to remove `$fred = decode('utf-8', $fred);`. And if so, your working program works because it's encoded using UTF-8 but you implicitly told Perl it was encoded using ASCII by not using `use utf8;`. — ikegami, Dec 09 '21 at 21:02
Please provide `sprintf "%vX", $s` for a working value and a failing value (from before the decode/encode). If I'm right in the previous comment, you will see code points (E9 for é) when it fails, and a string encoded using UTF-9 (C3 A9 for é IIRC) when it succeeds. — ikegami, Dec 09 '21 at 21:03
So when I use the stored value and print the sprintf just before the encode/decode, this is what I get: E9.EA.63.6F.6C.65, so you are right. Then if I remove the decode utf-8 part and only leave the encode ucs-2, I now get "é cole", the ê is missing. Then if I add use utf8; at the top, I get the same result. — Questionz, Dec 09 '21 at 21:26
So if I add use utf8; the output before the encode ucs-2 is E9.EA.63.6F.6C.65 and once the encode is done, the output becomes 0.E9.0.EA.0.63.0.6F.0.6C.0.65 — Questionz, Dec 09 '21 at 21:31
If I leave the use utf-8 at the top, and decode utf8, the output right after the decode is FFFD.FFFD.63.6F.6C.65 and once encoded back to ucs-2 the output is FF.FD.FF.FD.0.63.0.6F.0.6C.0.65 — Questionz, Dec 09 '21 at 21:37
The messages dont fail anymore although they dont come out right. I believe this was due to me having a too long messages once encoded. They now are submitting correctly when only using 'éêcole' but the issue remains, the accents are missing. — Questionz, Dec 09 '21 at 21:47
`0.E9.0.EA.0.63.0.6F.0.6C.0.65` is correct UCS-2be for `éêcole` — ikegami, Dec 09 '21 at 21:55

ikegami · Accepted Answer · 2021-12-09T22:10:40.380

2

$dbh->{'mysql_enable_utf8'} = 1; is used to decode the values returned from the database, causing queries to return decoded text (strings of Unicode Code Points). It makes no sense to decode such a string. Go straight to the encode.

my $s_ucp = "\xE9\xEA\x63\x6F\x6C\x65";  # éêcole
# -or-
use utf8; # Script is encoded using UTF-8.
my $s_ucp = "éêcole";

printf "%vX\n", $s_ucp;                  # E9.EA.63.6F.6C.65

my $s_ucs2be = encode('UCS-2', $s_ucp);

printf "%vX\n", $s_ucs2be;               # 0.E9.0.EA.0.63.0.6F.0.6C.0.65

edited Dec 09 '21 at 22:10

answered Dec 09 '21 at 22:01

ikegami

367,544
15
269
518

Ok, I apparently dont really understand how they all interact together, thanks for the explanations. So if I comment out $dbh->{'mysql_enable_utf8'}=1; and $dbh->do("set NAMES 'utf8'"); IT NOW WORKS !! If I could I would send you 12 beers :) I just spent two days on this. Thank you so much for taking the time to do this. – Questionz Dec 09 '21 at 22:32
So I really thought that $dbh->{'mysql_enable_utf8'} = 1 would somehow make sure that the utf-8 data is to be assigned as utf-8 in the perl variable, like telling explicitly to perl which encoding is the data encoded with – Questionz Dec 09 '21 at 22:43
Re "*would somehow make sure that the utf-8 data is to be assigned as utf-8 in the perl variable*", `set NAMES 'utf8'` sets how it's encoded over the wire. `$dbh->{'mysql_enable_utf8'}=1;` *decodes* from UTF-8. `mysql_enable_utf8 => 1` passed to `connect` does both. – ikegami Dec 10 '21 at 00:50

score 0 · Answer 2 · answered Dec 10 '21 at 21:07

SET NAMES says the encoding you have/want in the client. That is, regardless of the encoding in the table, MySQL will convert it to whatever SET NAMES says during a SELECT.

So, feed what comes from the SELECT directly to SMPP. (It won't be readable by most other clients.)

SET NAMES ucs2

(The collation is irrelevant to the encoding.)

You could ask the SELECT to convert with something like

CONVERT(col_name, CHAR UNICODE)

https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html

Perl issue when encoding mysql data from UTF-8 to UCS-2 for SMPP

2 Answers2