How can I convert CGI input to UTF-8 without Perl's Encode module?

Question

Through this forum, I have learned that it is not a good idea to use the following for converting CGI input (from either an escape()d Ajax call or a normal HTML form post) to UTF-8:

read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
utf8::decode $_;

A safer way (which for example does not allow bogus characters through) is to do the following:

use Encode qw (decode);
read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
decode ('UTF-8', $_, Encode::FB_CROAK);

I would, however, very much like to avoid using any modules (including XSLoader, Exporter, and whatever else they bring with them). The function is for a high-volume mod_perl driven website and I think both performance and maintainability will be better without modules (especially since the current code does not use any).

I guess one approach would be to examine the Encode module and strip out the functions and constants used for the “decode ('UTF-8', $_, Encode::FB_CROAK)” call. I am not sufficiently familiar with Unicode and Perl modules to do this. Maybe somebody else is capable of doing this or know a similar, safe “native” way of doing the UTF-8 conversion?

UPDATE:

I prefer keeping things non-modular, because then the only black-box is Perl's own compiler (unless of course you dig down into the module libs).

Sometimes you see large modules being replaced with a few specific lines of code. For example, instead of the CGI.pm module (which people are also in love with), one can use the following for parsing AJAX posts:

my %Input;
if ($ENV{CONTENT_LENGTH}) {
    read (STDIN, $_, $ENV{CONTENT_LENGTH});
    foreach (split (/&/)) {
        tr/+/ /; s/%([a-fA-F0-9]{2})/pack("C", hex($1))/eg;
        if (m{^(\w+)=\s*(.*?)\s*$}s) { $Input{$1} = $2; }
        else { die ("bad input ($_)"); }
    }
}

In a similar way, it would be great if one could extract or replicate Encode's UTF-8 decode function.

possible duplicate of [Checklist for going the Unicode way with Perl](http://stackoverflow.com/questions/3735721/checklist-for-going-the-unicode-way-with-perl) — brian d foy, Sep 19 '10 at 14:27
Encode comes with Perl, and shouldn't "Doing it right" trump anything else? Modules are just code. — brian d foy, Sep 19 '10 at 14:28
I cover most of this stuff in _Effective Perl Programming_, which I think I mentioned to you previously. Encode is the native way to do it. Perl separates big chunks of functionality into modules so you don't have to use the stuff you don't want. — brian d foy, Sep 19 '10 at 14:30
Modules are not a black box. You can look at their source. Most people are not in love with CGI. They recommend it to people who don't know what they are doing because it's at least a starting point. Your CGI parser, for instance, is horribly broken for all the same reasons that other people who don't know what they are doing break things. For instance, & is not always the parameter separator, and parameters can have multiple values. You handle neither of those. Look at CGI.pm to see what it does and what you have to handle. — brian d foy, Sep 19 '10 at 19:38
[Black box] I did write “unless of course you dig down into the module libs”. But that takes time and you still won’t know what future updates will bring. [CGI Parser] You are right, but ironically that proves my point. I am getting CGI input from my own Ajax call, so I don’t need to look for alternative separators or multiple values. Hence, my own non-module code is faster and simpler than the W3O-compliant CGI.pm. — W3Coder, Sep 19 '10 at 19:55
Well, do what you want, but you don't seem to be having much success doing it your way. I won't waste my time answering your questions since you obviously don't care. — brian d foy, Sep 19 '10 at 23:37
I do care, that’s why I have been trying to explain my point of view. And so far, you haven’t answered my question, you have just tried to prove me wrong. — W3Coder, Sep 20 '10 at 07:32
You say: *A safer way […] is to do the following […]*. [But this is not what I said.](http://stackoverflow.com/q/3735721#3736787) I specifically recommended the `URI::Escape::XS` module instead of `unpack`. XS modules run at the same order of speed as built-ins, so your performance concerns do not carry any weight. As brian said, [profile](http://p3rl.org/Devel::NYTProf#Apache_Profiling) first, you'll be surprised where your program actually spends its time. — daxim, Sep 20 '10 at 08:12
@daxim. I never quoted you, sorry if you felt that way. It is the “decode ('UTF-8', $_, Encode::FB_CROAK)” part which is my focus here. Thanks for the profile link, though, that is useful. — W3Coder, Sep 20 '10 at 14:22

score 6 · Answer 1 · answered Sep 19 '10 at 14:33

6

Don't pre-optimize. Do it the conventional way first then profile and benchmark later to see where you need to optimize. People usually waste all their time somewhere else, so starting off blindfolded and hadcuffed doesn't give you any benefit.

Don't be afraid of modules. The point of mod_perl is to load up everything as few times as possible so the startup time and module loading time are insignificant.

answered Sep 19 '10 at 14:33

brian d foy

129,424
31
207
592

What's wrong with pre-optimizing, when you know exactly what you are going to need? Why go through the trouble of benchmarking if you don't have to (because you have shaved off all unnecessary logic)? Of course, you have a point regarding mod_perl, and I generally acknowledge the fact that your knowledge of Perl is about 1000 times greater than mine. So I am certainly taking your point of view into consideration and look forward to hearing other peoples' point of view. – W3Coder Sep 19 '10 at 19:32
1

Well, I don't think you know exactly what you need. It doesn't sound like you know what you are doing. – brian d foy Sep 19 '10 at 19:36
Very constructive comment. If I knew what to do, would I be asking questions? You see, I believe that is the purpose of this site - not the promotion of commercial books by their authors. – W3Coder Sep 20 '10 at 07:40
The purpose of this site is to help people who genuinely want help. You don't appear to want real help. Instead, you're looking for validation for your pre-conceived ideas. I don't feel bad about promoting my books. I don't feel bad about promoting other people's books. "The more that you read, the more things you will know. The more that you learn, the more places you'll go." That's why we write books. – brian d foy Sep 22 '10 at 13:56

score 1 · Accepted Answer · answered Sep 21 '10 at 09:59

Don't use escape() to create your posted data. This isn't compatible with URL-encoding, it's a mutant JavaScript oddity which should normally never be used. One of the defects is that it will encode non-ASCII characters to non-standard %uNNNN sequences based on UTF-16 code units, instead of standard URL-encoded UTF-8. Your current code won't be able to handle that.

You should typically use encodeURIComponent() instead.

If you must URL-decode posted input yourself rather than using a form library (and this does mean you won't be able to handle multipart/form-data), you will need to convert + symbols to spaces before replacing %-sequences. This replacement is standard in form submissions (though not elsewhere in URL-encoded data).

To ensure input is valid UTF-8 if you really don't want to use a library, try this regex. It also excludes some control characters (you may want to tweak it to exclude more).

Your input is much appreciated. I was aware of the shortcomings of escape (), binary/multipart posting, etc., but the RegEx you are linking to seems very useful. Whether my approach to decoding UTF-8 makes sense or not, time will show, but your answer is definitely helpful, thanks very much! — W3Coder, Sep 21 '10 at 13:24

How can I convert CGI input to UTF-8 without Perl's Encode module?

2 Answers2