2

I experience a strange behavior in Perl while trying to decode a Unicode JSON string coming from a PHP script's json_encode function. I simplified the problem to next code:

#!/usr/bin/perl
use CGI;
use JSON;
print CGI::header(-type=>'text/html', -charset=>'UTF-8');

print %{ decode_json('{"test_1" : "= \u00F9 ="}') }->{'test_1'};
print '<br>';
print %{ decode_json('{"test_2" : "= \u00F9 \u0121 ="}') }->{'test_2'};

When I run this script in browser I see next:

= � =
= ù ġ =

The first line contains a "broken character", the second is correct. What I think is happenning is that for some reason Perl decodes first string in ISO-8859-1 encoding, if I change page encoding to ISO-8859-1 the first line is correct, however the second is broken.

My Perl version is 5.10.1 and the JSON version is 2.51.

Question: how to force Perl json_decode to return UTF-8 characters in the first print?

Note: I can fix the problem by manually converting first output to UTF-8, but this requires the installation of an additional "Encoder" module, which I want to avoid.

daxim
  • 39,270
  • 4
  • 65
  • 132
braz
  • 159
  • 3
  • 10

1 Answers1

4

Tried your code and it generated several warnings with "use warnings;"

If you want to be sure to get utf8 I believe you have to tell Perl so. Use "binmode(STDOUT, ":utf8");" or similar.

This works on the command-line:

use strict;
use warnings;
use JSON;

binmode(STDOUT, ":utf8");

print decode_json('{"test_1" : "= \u00F9 ="}')->{test_1};
print '<br>';
print decode_json('{"test_2" : "= \u00F9 \u0121 ="}')->{'test_2'};

EDIT: AFAIK, this does not affect decode_json(), but the output from the perl script itself. Unicode tutorials often tell you to explicitly state what encoding you want on your input & output (filehandlers)

Øyvind Skaar
  • 2,278
  • 15
  • 15
  • however it is strange that perl can't decode "\u..." character in utf-8 by default – braz Apr 05 '11 at 11:48
  • 1
    No, it's not that.. read http://joelonsoftware.com/articles/Unicode.html , then http://perldoc.perl.org/perlunitut.html and then take a look at http://perldoc.perl.org/perlunifaq.html – Øyvind Skaar Apr 06 '11 at 08:37
  • from the faq: "The Perl warning "Wide character in ..." is caused by a character with an ordinal value greater than 255. With no specified encoding layer, Perl tries to fit things in ISO-8859-1 for backward compatibility reasons. When it can't, it emits this warning (if warnings are enabled), and outputs UTF-8 encoded data instead." – Øyvind Skaar Apr 06 '11 at 08:40
  • I've read the first article, it is very nice. Just to make sure that I understood everything correctly, is my explanation below correct: – braz Apr 06 '11 at 10:20
  • in first string of my example perl sees character which can be converted in IS0-8859-1 so it does so and because my page encoding is UTF-8 the character looks broken, when perl meets second string it sees the second character \u0121 which can't be converted to iso-8859-1 and perl drops a warning and converts the whoe string to UTF-8 ? – braz Apr 06 '11 at 10:34