I have a Perl CGI script that is accessing Thai language, UTF-8 strings from a PostgreSQL DB and returning them to a web-based front end as JSON. The strings are fine when I get them from the DB and after I encode them as JSON (based on writing to a log file). However, when the client receives them they are corrupted, for example:
featurename "à¹\u0082รà¸\u0087à¹\u0080รียà¸\u0099วัà¸\u0094ภาษี"
Clearly some chars are being converted to Unicode escape sequences, but not all.
I could really use some suggestions as to how to solve this.
Simplified code snippet follows. I am using 'utf8' and 'utf8::all', as well as 'JSON'.
Thanks in advance for any help you can provide.
my $dataId = $cgi->param('dataid');
my $table = "uploadpoints";
my $sqlcommand = "select id,featurename from $table where dataid=$dataId;";
my $stmt = $gDbh->prepare($sqlcommand);
my $numrows = $stmt->execute;
# print JSON header
print <<EOM;
Content-type: application/json; charset="UTF-8"
EOM
my @retarray;
for (my $i = 0; ($i < $numrows); $i=$i+1)
{
my $hashref = $stmt->fetchrow_hashref("NAME_lc");
#my $featurename = $hashref->{'featurename'};
#logentry("Point $i feature name is: $featurename\n");
push @retarray,$hashref;
}
my $json = encode_json (\@retarray);
logentry("JSON\n $json");
print $json;
I have modified and simplified the example, now running locally rather than via browser invocation:
my $dataId = 5;
my $table = "uploadpoints";
my $sqlcommand = "select id,featurename from $table where dataid=$dataId and id=75;";
my $stmt = $gDbh->prepare($sqlcommand);
my $numrows = $stmt->execute;
my @retarray;
for (my $i = 0; ($i < $numrows); $i=$i+1)
{
my $hashref = $stmt->fetchrow_hashref("NAME_lc");
my $featurename = $hashref->{'featurename'};
print "featurename $featurename\n";
push @retarray,$hashref;
}
my $json = encode_json (\@retarray);
print $json;
Using hexdump as in Stefan's example, I've determined that the data as read from the database are already in UTF-8. It looks as though they are being re-encoded in the JSON encode method. But why?
The data in the JSON use exactly twice as many bytes as the original UTF-8.
perl testcase.pl | hexdump -C
00000000 66 65 61 74 75 72 65 6e 61 6d 65 20 e0 b9 82 e0 |featurename ....|
00000010 b8 a3 e0 b8 87 e0 b9 80 e0 b8 a3 e0 b8 b5 e0 b8 |................|
00000020 a2 e0 b8 99 e0 b9 81 e0 b8 88 e0 b9 88 e0 b8 a1 |................|
00000030 e0 b8 88 e0 b8 b1 e0 b8 99 e0 b8 97 e0 b8 a3 e0 |................|
00000040 b9 8c 0a 5b 7b 22 66 65 61 74 75 72 65 6e 61 6d |...[{"featurenam|
00000050 65 22 3a 22 c3 a0 c2 b9 c2 82 c3 a0 c2 b8 c2 a3 |e":"............|
00000060 c3 a0 c2 b8 c2 87 c3 a0 c2 b9 c2 80 c3 a0 c2 b8 |................|
00000070 c2 a3 c3 a0 c2 b8 c2 b5 c3 a0 c2 b8 c2 a2 c3 a0 |................|
00000080 c2 b8 c2 99 c3 a0 c2 b9 c2 81 c3 a0 c2 b8 c2 88 |................|
00000090 c3 a0 c2 b9 c2 88 c3 a0 c2 b8 c2 a1 c3 a0 c2 b8 |................|
000000a0 c2 88 c3 a0 c2 b8 c2 b1 c3 a0 c2 b8 c2 99 c3 a0 |................|
000000b0 c2 b8 c2 97 c3 a0 c2 b8 c2 a3 c3 a0 c2 b9 c2 8c |................|
000000c0 22 2c 22 69 64 22 3a 37 35 7d 5d |","id":75}]|
000000cb
Further suggestions? I tried using decode on the UTF string but got errors related to wide characters.
I did read the recommended answer from Tom Christianson, as well as his Unicode tutorials, but I will admit much of it went over my head. Also it seems my problem is considerably more constrained.
I did wonder whether retrieving the hash value and assigning it to a normal variable was doing some sort of auto-decoding or encoding. I do not really understand when Perl uses its internal character format as opposed to when it retains the external encoding.
UPDATE WITH SOLUTION
Turns out that since the string retrieved from the DB is already in UTF-8, I need to use 'to_json' rather than 'encode_json'. This fixed the problem. Learned a lot about Perl Unicode handling in the process though...
Also recommend: http://perldoc.perl.org/perluniintro.html
Very clear exposition.