1

I wondered why some german umlauts were scrambled on our page. Then i found out that the recent version of JSON (i use 2.07) does convert strings in an other manner than JSON 1.5.

Problem here is that i have a hash with strings like

use Data::Dumper;
my $test = {
  'fields' => 'überrascht'
};

print Dumper(to_json($test)); gives me

$VAR1 = "{ \"fields\" : \"\x{fc}berrascht\" } "; 

Using the old module using

$json = JSON->new();
print Dumper ($json->to_json($test));

gives me (the correct result)

$VAR1 = '{"fields":[{"title":"überrascht"}]}'; 

So umlauts are scrammbled using the new JSON 2 module.

What do i need to get them correct?

Update: It might be bad to use Data::Dumper to show output, because Dumper uses its own encoding. Well, a difference in the result from Dumper shows that anything is treated differently here. It might be better to describe the backend as Brad mentioned: The json string gets printed using Template-Toolkit and then gets assigned to a javascript variable for further use. The correct javascript shows something like this

{
    "title" : "Geändert",
},

using the new module i get

{
    "title" : "Geändert",
},

The target page is in 8859-1 (latin1). Any suggestions?

Thariama
  • 50,002
  • 13
  • 138
  • 166
  • 1
    What's the encoding of the file and what encoding does your terminal expect? It's hard to tell what used to get produced. – ikegami May 17 '13 at 18:05
  • If you can still run using the old version of the module, what do you get if you add `$Data::Dumper::Useqq = 1;`? – ikegami May 17 '13 at 18:10
  • Which backend are your examples using? `print $_,"\n" for grep { m(JSON/) } keys %INC` – Brad Gilbert May 18 '13 at 15:24
  • @ikegami: i get \374 istead of the correct "ü" i got before when i use the old version of the module (in an identical second system) – Thariama May 21 '13 at 09:39
  • @BradGilbert: the json string gets printed using Template-Toolkit and then gets assigned to a javascript variable for further use – Thariama May 21 '13 at 09:40
  • @ikegami: i get \377 instead of the correct 'ü' using the old version of the module – Thariama May 21 '13 at 09:49
  • 1
    Are you saying that DD with `Useeqq=1` gives `"...\374..."`? That means the string contains `ü` encoded using iso-8859-1, which is exactly what you want. – ikegami May 21 '13 at 17:55

4 Answers4

5

\x{fc} is ü, at least in Latin-1, Latin-9 etc. Also, ü is codepoint U+00FC in Unicode. However, we want UTF-8 (I suppose). The easiest solution to get UTF-8 string literals is to save your Perl source code with this encoding, and put a use utf8; at the top of your script.

Then, encoding the string as JSON yields correct output:

use strict; use warnings; use utf8;
use Data::Dumper; use JSON;
print Dumper encode_json {fields => "nicht überrascht"};

The encode_json assumes UTF-8. Read the documentation for more info.

Output:

$VAR1 = '{"fields":"nicht überrascht"}';

(JSON module version: 2.53)

amon
  • 57,091
  • 2
  • 89
  • 149
  • 1
    @ikegami I realized that, thus the “save your Perl source code with this encoding [UTF-8]”. Using UTF8 source code is just one easy way to solve the issue, there are also your ways to do it. – amon May 17 '13 at 20:56
  • 1
    ack, overlooked that part of the sentence. – ikegami May 17 '13 at 20:58
  • 1
    wow, just read the rest of your post, and it's very misleading. `use utf8;` doesn't produce UTF-8-encoded strings (and it definitely doesn't produce string literals of any kind). It actually does the opposite (decode from UTF-8). "UTF-8 string literals" should be "upgraded strings". – ikegami May 17 '13 at 21:06
  • Secondly, `encode_json` doesn't "assume UTF-8", whatever that means. Maybe you meant it expects strings to contain text (Unicode code points), but it doesn't care whether they're upgraded or not. Maybe you meant it "produces UTF-8"? – ikegami May 17 '13 at 21:07
  • the source code is in cp1252, but this has not been a problem before, the data is coming from the database – Thariama May 21 '13 at 10:33
5
my $json_text = to_json($data);

is short for

my $json_text = JSON->new->encode($data);

This returns a string of Unicode Code Points. U+00FC is indeed the correct Unicode code point for "ü", so the output is correct. (As proof, the HTML source for that is actually "ü".)

It's hard to tell what your original output actually contained (since you showed non-ASCII characters), so it's hard to determine what your problem is actually.

But one thing you must do before outputing the string is to convert it from a string of code points into bytes, say, by using Encode's encode or encode_utf8.

my $json_cp1252 = encode('cp1252', to_json($data));

my $json_utf8 = encode_utf8(to_json($data));

If the appropriate encoding is UTF-8, you can also use any of the following:

my $json_utf8 = to_json($data, { utf8 => 1 });

my $json_utf8 = encode_json($data);

my $json_utf8 = JSON->new->utf8->encode($data);
ikegami
  • 367,544
  • 15
  • 269
  • 518
2

Use encode_json instead. According to the manual it converts the given Perl data structure to a UTF-8 encoded, binary string.

Regarding your update: If you actually want to produce JSON in Latin1 (ISO-8859-1), you can try:

to_json($test, { latin1 => 1 })

Or

JSON->new->latin1->encode($test)

Note that if you dump the result, getting \x{fc} for ü is correct in this case. I guess that the root of your problem is that you receive text in Perl's UTF-8 format from somewhere. In this case, the latin1 option of the JSON module is needed.

You can also try to use ascii instead of latin1 as the safest option.

Another solution might be to specify an output encoding for Template-Toolkit. I don't know if that's possible. Or, you could encode your result as Latin1 in the final step before sending it to the client.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113
  • 1
    Works for me. Make sure that your source file is either encoded in Latin1, or add `use utf8` if it's encoded in UTF-8. – nwellnhof May 17 '13 at 16:41
  • i tried to encode the string in latin1, but the result did not change (JSON->new->latin1->encode($str) – Thariama May 21 '13 at 12:49
  • +1 there was another json_encoding of the data that i missed - works like a charm now - thx – Thariama May 21 '13 at 14:09
  • concerning json: can you tell me why i still get the warning in the log file: Prototype mismatch: sub ModPerl::ROOT::ModPerl::PerlRun::mypath_myfile_2epl::from_json: none vs ($@) at mypath_myfile.pl line 6. – Thariama May 21 '13 at 14:11
  • See [this question](http://stackoverflow.com/questions/15770114/prototype-mismatch-error-perl). – nwellnhof May 21 '13 at 14:49
  • i do not have any package declaration at all in that script – Thariama May 22 '13 at 09:26
2

Strictly-speaking, Latin-1-encoded JSON is not valid JSON. The JSON spec allows UTF-8, UTF-16 or UTF-32 encodings.

If you want to be standards-compliant or you want to ensure your JSON will be compatible with both your current pages and future UTF-8-based pages, you need to use JSON->new->utf8->encode($str). Being strict about generated valid JSON could save you lots of headaches in the future.

You can translate UTF-8 JSON to Latin-1 using client-side Javascript if you need to, using this trick.

The ascii option also produces valid JSON, by escaping any non-ASCII characters using valid JSON unicode escapes. But the latin1 option does not, and therefore should be avoided IMHO. The utf8(0) option should be avoided too unless you specify an encoding when writing the data out to clients: utf8(0) is subtly different from the utf8 option in that it generates Perl character strings instead of byte strings. If you do any I/O using character strings without specifying an encoding, Perl will translate it on-the-fly back to Latin-1. The utf8 option generates raw UTF-8 bytes, which are perfect for doing raw I/O.

Community
  • 1
  • 1
simonp
  • 2,827
  • 1
  • 14
  • 19