3

I'm trying to write up an example of testing query string parsing when I got stumped on a Unicode issue. In short, the letter "Omega" (Ω) doesn't seem to be decoded correctly.

  • Unicode: U+2126
  • 3-byte sequence: \xe2\x84\xa6
  • URI encoded: %E2%84%A6

So I wrote this test program verify that I could "decode" unicode query strings with URI::Encode.

use strict;                                                                                                                                                                    
use warnings;
use utf8::all;    # use before Test::Builder clones STDOUT, etc.
use URI::Encode 'uri_decode';
use Test::More;

sub parse_query_string {
    my $query_string = shift;
    my @pairs = split /[&;]/ => $query_string;

    my %values_for;
    foreach my $pair (@pairs) {
        my ( $key, $value ) = split( /=/, $pair );
        $_ = uri_decode($_) for $key, $value;
        $values_for{$key} ||= [];
        push @{ $values_for{$key} } => $value;
    }
    return \%values_for;
}

my $omega = "\N{U+2126}";
my $query = parse_query_string('alpha=%E2%84%A6');
is_deeply $query, { alpha => [$omega] }, 'Unicode should decode correctly';

diag $omega;
diag $query->{alpha}[0];

done_testing;

And the output of the test:

query.t .. 
not ok 1 - Unicode should decode correctly
#   Failed test 'Unicode should decode correctly'
#   at query.t line 23.
#     Structures begin differing at:
#          $got->{alpha}[0] = 'â¦'
#     $expected->{alpha}[0] = 'Ω'
# Ω
# â¦
1..1
# Looks like you failed 1 test of 1.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests 

Test Summary Report
-------------------
query.t (Wstat: 256 Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
Files=1, Tests=1,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.05 cusr  0.00 csys =  0.09 CPU)
Result: FAIL

It looks to me like URI::Encode may be broken here, but switching to URI::Escape and using the uri_unescape function reports the same error. What am I missing?

Ovid
  • 11,580
  • 9
  • 46
  • 76
  • 1
    The `CGI` module offers the [pragma import `-utf8` to decode input automatically](http://p3rl.org/CGI#utf8). This works as intended: `perl -e'use CGI qw(-utf8); my $cgi = CGI->new("alpha=%E2%84%A6"); use Devel::Peek; Dump $cgi->param("alpha")'` Beware of the caveat mentioned in the documentation. – daxim Apr 10 '12 at 10:18

4 Answers4

7

the URI encoded characters simply represents utf-8 sequences, and URI::Encode and URI::Escape simply decodes them into a utf-8 byte string, and neither of them decode the bytestrings as UTF-8 (which is a correct behavior as a generic URI decoding library).

Put it another way, your code basically does: is "\N{U+2126}", "\xe2\x84\xa6" and that will fail, since upon comparison, perl upgrades the latter as a 3-character-length latin-1 strings.

You have to manually decode the input value with Encode::decode_utf8 after uri_decode, or instead compare encoded utf8 byte sequence.

miyagawa
  • 1,329
  • 7
  • 9
5

URI escaping represents octets and knows nothing about character encodings, so you have to decode from UTF-8 octets to characters yourself, e.g.:

$_ = decode_utf8(uri_decode($_)) for $key, $value;
ilmari
  • 301
  • 1
  • 3
4

The problem can be seen in incorrect details in your own explanation of the problem. What you are dealing with is really:

  • Unicode codepoint: U+2126
  • UTF-8 encoding of codepoint: \xe2\x84\xa6
  • URI encoding of UTF-8 encoding of codepoint: %E2%84%A6

The problem is that you only undid one of the encodings.

Solutions have already been presented. I just wanted to present an alternate explanation.

ikegami
  • 367,544
  • 15
  • 269
  • 518
0

I'd recommend that you have a look at Why does modern Perl avoid UTF-8 by default? for a thorough discussion on this topic.

I would add to the discussion there:

  • You'll notice a lot of odd glyphs on the page. This was intentional on the part of the author.
  • I've tried the Symbola font recommended in the thread and it looked horrible on Win 7. YMMV.
  • Reading Why does modern Perl avoid UTF-8 by default? too frequently may lead to depression and lingering doubts about your life choices.
Community
  • 1
  • 1
converter42
  • 7,400
  • 2
  • 29
  • 24