Perl-generated UTF8 strings in JSON corrupted on client side

Question

I have a Perl CGI script that is accessing Thai language, UTF-8 strings from a PostgreSQL DB and returning them to a web-based front end as JSON. The strings are fine when I get them from the DB and after I encode them as JSON (based on writing to a log file). However, when the client receives them they are corrupted, for example:

featurename "à¹\u0082à¸£à¸\u0087à¹\u0080à¸£à¸µà¸¢à¸\u0099à¸§à¸±à¸\u0094à¸ à¸²à¸©à¸µ"

Clearly some chars are being converted to Unicode escape sequences, but not all.

I could really use some suggestions as to how to solve this.

Simplified code snippet follows. I am using 'utf8' and 'utf8::all', as well as 'JSON'.

Thanks in advance for any help you can provide.

my $dataId = $cgi->param('dataid');
my $table = "uploadpoints";
my $sqlcommand = "select id,featurename from $table where dataid=$dataId;";
my $stmt = $gDbh->prepare($sqlcommand);
my $numrows = $stmt->execute;
# print JSON header
print <<EOM;
Content-type: application/json; charset="UTF-8"


EOM
my @retarray;
for (my $i = 0; ($i < $numrows); $i=$i+1)
{
    my $hashref = $stmt->fetchrow_hashref("NAME_lc");
    #my $featurename = $hashref->{'featurename'};
    #logentry("Point $i feature name is: $featurename\n");
    push @retarray,$hashref;
}
my $json = encode_json (\@retarray);
logentry("JSON\n $json");
print $json;

I have modified and simplified the example, now running locally rather than via browser invocation:

my $dataId = 5; 
my $table = "uploadpoints";
my $sqlcommand = "select id,featurename from $table where dataid=$dataId and id=75;";
my $stmt = $gDbh->prepare($sqlcommand);
my $numrows = $stmt->execute;
my @retarray;
for (my $i = 0; ($i < $numrows); $i=$i+1)
{
    my $hashref = $stmt->fetchrow_hashref("NAME_lc");
    my $featurename = $hashref->{'featurename'};
    print "featurename $featurename\n";
    push @retarray,$hashref;
}
my $json = encode_json (\@retarray);
print $json;

Using hexdump as in Stefan's example, I've determined that the data as read from the database are already in UTF-8. It looks as though they are being re-encoded in the JSON encode method. But why?

The data in the JSON use exactly twice as many bytes as the original UTF-8.

 perl testcase.pl | hexdump -C
00000000  66 65 61 74 75 72 65 6e  61 6d 65 20 e0 b9 82 e0  |featurename ....|
00000010  b8 a3 e0 b8 87 e0 b9 80  e0 b8 a3 e0 b8 b5 e0 b8  |................|
00000020  a2 e0 b8 99 e0 b9 81 e0  b8 88 e0 b9 88 e0 b8 a1  |................|
00000030  e0 b8 88 e0 b8 b1 e0 b8  99 e0 b8 97 e0 b8 a3 e0  |................|
00000040  b9 8c 0a 5b 7b 22 66 65  61 74 75 72 65 6e 61 6d  |...[{"featurenam|
00000050  65 22 3a 22 c3 a0 c2 b9  c2 82 c3 a0 c2 b8 c2 a3  |e":"............|
00000060  c3 a0 c2 b8 c2 87 c3 a0  c2 b9 c2 80 c3 a0 c2 b8  |................|
00000070  c2 a3 c3 a0 c2 b8 c2 b5  c3 a0 c2 b8 c2 a2 c3 a0  |................|
00000080  c2 b8 c2 99 c3 a0 c2 b9  c2 81 c3 a0 c2 b8 c2 88  |................|
00000090  c3 a0 c2 b9 c2 88 c3 a0  c2 b8 c2 a1 c3 a0 c2 b8  |................|
000000a0  c2 88 c3 a0 c2 b8 c2 b1  c3 a0 c2 b8 c2 99 c3 a0  |................|
000000b0  c2 b8 c2 97 c3 a0 c2 b8  c2 a3 c3 a0 c2 b9 c2 8c  |................|
000000c0  22 2c 22 69 64 22 3a 37  35 7d 5d                 |","id":75}]|
000000cb

Further suggestions? I tried using decode on the UTF string but got errors related to wide characters.

I did read the recommended answer from Tom Christianson, as well as his Unicode tutorials, but I will admit much of it went over my head. Also it seems my problem is considerably more constrained.

I did wonder whether retrieving the hash value and assigning it to a normal variable was doing some sort of auto-decoding or encoding. I do not really understand when Perl uses its internal character format as opposed to when it retains the external encoding.

UPDATE WITH SOLUTION

Turns out that since the string retrieved from the DB is already in UTF-8, I need to use 'to_json' rather than 'encode_json'. This fixed the problem. Learned a lot about Perl Unicode handling in the process though...

Also recommend: http://perldoc.perl.org/perluniintro.html

Very clear exposition.

Oh wait in what format do you get the UTF-8 strings in the DB query? UTF-8 encoded? Then you probably need to `utf8::decode($hashref->{featurename})` first! — Stefan Becker, Feb 15 '19 at 08:21
I retracted my answer. It is obviously impossible to write test code that show the correct behavior without the **exact** input string. I.e. you question is missing the exact contents of `$hashref->{featurename}`. Is it an octet string with encoded UTF-8? Is it a string in the internal Perl representation. i.e. UTF-8 decoded? Note that this can change depending if an operation is executed under `use utf8;`or `no utf8;`. — Stefan Becker, Feb 15 '19 at 08:52
Hello Stefan... How can I tell the "exact input string" format, please? (All operations are executed under 'use utf8;') When I print the string to a file then view in a terminal with UTF8 encoding set, the Thai characters render correctly. This is true both when I display the info after retrieving from the DB and also after the encode_json operation. — Sally Goldin, Feb 15 '19 at 09:30
Please see the updated answer I have posted for some example code. For your code it will be important to know if a scalar string value returned by DBI `fetchrow*` is a raw octet string (-> needs decoding) or a Perl string (-> already decoded). — Stefan Becker, Feb 15 '19 at 10:02
Just stumbled over [Tom Christiansen's answer](https://stackoverflow.com/a/6163129/8866606). Probably recommended primer if you do anything with Unicode/UTF-8 on Perl. — Stefan Becker, Feb 15 '19 at 16:32
Did you configure your db connection ? `SET client_encoding TO 'UTF-8'` — clamp, Feb 15 '19 at 19:33
Depending on your DBD::Pg version you also might need to set `$dh->{pg_enable_utf8}=1` — clamp, Feb 15 '19 at 19:39

Stefan Becker · Answer 1 · 2019-02-15T16:30:54.933

NOTE: you should probably also read this answer, which makes my answer sub-par in comparison :-)

The problem is that you have to be sure in which format each string is, otherwise you'll get incorrect conversions. When handling UTF-8 a string can be in two formats:

raw UTF-8 encoded octet string, i.e. \x{100} represented as two octets 0xC4 0x80
internal Perl string representation, i.e. one Unicode character \x{100} (U+0100 Ā LATIN CAPITAL LETTER A WITH MACRON)

If I/O is involved you also need to know if the I/O layer does UTF-8 de/encoding or not. For terminal I/O you also have to consider if it understands UTF-8 or not. Both taken together can make it difficult to get meaningful debug printouts from your code.

If you Perl code needs to process UTF-8 strings after reading them from the source, you must make sure that they are in internal Perl format. Otherwise you'll get surprising result when you call code that expects Perl strings and not raw octet strings.

I try to show this in my example code:

#!/usr/bin/perl
use warnings;
use strict;

use JSON;

open(my $utf8_stdout, '>& :encoding(UTF-8)', \*STDOUT)
    or die "can't reopen STDOUT as utf-8 file handle: $!\n";

my $hex = "C480";
print "${hex}\n";

my $raw = pack('H*', $hex);
print STDOUT       "${raw}\n";
print $utf8_stdout "${raw}\n";

my $decoded;
utf8::decode($decoded = $raw);
print STDOUT       ord($decoded), "\n";
print STDOUT       "${decoded}\n"; # Wide character in print at...
print $utf8_stdout "${decoded}\n";

my $json = JSON->new->encode([$decoded]);
print STDOUT       "${json}\n"; # Wide character in print at...
print $utf8_stdout "${json}\n";

$json = JSON->new->utf8->encode([$decoded]);
print STDOUT       "${json}\n";
print $utf8_stdout "${json}\n";

exit 0;

Copy & paste from my terminal (which supports UTF-8). Look closely at the differences between the lines:

$ perl dummy.pl
C480
Ā
Ä
256
Wide character in print at dummy.pl line 21.
Ā
Ā
Wide character in print at dummy.pl line 25.
["Ā"]
["Ā"]
["Ā"]
["Ä"]

But compare this to the following, where STDOUT is not a terminal, but piped to another program. The hex dump always shows "c4 80", i.e. UTF-8 encoded.

$ perl dummy.pl | hexdump -C
Wide character in print at dummy.pl line 21.
Wide character in print at dummy.pl line 22.
Wide character in print at dummy.pl line 25.
Wide character in print at dummy.pl line 26.
00000000  43 34 38 30 0a c4 80 0a  c4 80 0a 5b 22 c4 80 22  |C480.......[".."|
00000010  5d 0a 5b 22 c4 80 22 5d  0a 43 34 38 30 0a c4 80  |].[".."].C480...|
00000020  0a 32 35 36 0a c4 80 0a  5b 22 c4 80 22 5d 0a 5b  |.256....[".."].[|
00000030  22 c4 80 22 5d 0a                                 |".."].|
00000036

Perl-generated UTF8 strings in JSON corrupted on client side

1 Answers1