0

I have been busting my noggin trying to figure out how to handle some special common characters that are input by users in forms. Examples of what I mean are the copyright sign, registered sign, fraction 1/2, fraction 1/4, etc. So here is what happens:

Users enter these characters, and they are saved into a regular text file. No problem. They are saved in their true and pure form. Now when we grab them with the Perl CGI file, and they are displayed in a browser, I get all these "A"s and other A-characters with markings above them. I am running a subroutine on the string to try to convert these from Unicode matches into HTML entities, but it doesn't seem to be working.

Perl Code:

#string with special characters
$special_chars=encodebc($special_chars);

sub encodebc{
$answer=$_[0];
$answer =~ s/:://gi;
$answer =~ s/\x{0022}/"/g;
$answer =~ s/\x{0027}/'/g;
$answer =~ s/\x{0026}/&/g;
$answer =~ s/\x{003C}/</g;
$answer =~ s/\x{003E}/>/g;
$answer =~ s/\x{0060}/`/g;
$answer =~ s/\x{007B}/{/g;
$answer =~ s/\x{007D}/}/g;
$answer =~ s/\x{00A9}/©/g;
$answer =~ s/\x{00AE}/®/g;
$answer =~ s/\x{00AB}/«/g;
$answer =~ s/\x{00BB}/»/g;
$answer =~ s/\x{00A2}/¢/g;
$answer =~ s/\x{00B0}/°/g;
$answer =~ s/\x{00B2}/²/g;
$answer =~ s/\x{00B3}/³/g;
$answer =~ s/\x{00B5}/µ/g;
$answer =~ s/\x{00BC}/¼/g;
$answer =~ s/\x{00BD}/½/g;
$answer =~ s/\x{00BE}/¾/g;
$answer =~ s/\x{00E1}/á/g;
$answer =~ s/\x{00E9}/é/g;
$answer =~ s/\x{00F1}/ñ/g;
$answer =~ s/\x{00F5}/õ/g;
$answer =~ s/\x{00F8}/ø/g;
return $answer;
}

In the above code, I'm matching for two-byte characters in Unicode...so I'm not understanding where the "A" characters are coming from.

Before you downvote me, please know I have spent hours upon hours working on this and reading trying to figure this out. I appreciate anyone who can help me out here.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Bob
  • 165
  • 2
  • 13
  • 1
    *very* long answer about handling Unicode in Perl here: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default – roeland Nov 20 '15 at 03:52
  • 1
    But in short: “their true and pure form” doesn't really mean anything. It has to be stored using some encoding like UTF-8 or UTF-16. If your text looks like `'áâã'` then probably you're sending UTF-8 but declaring it as ISO-8859-1 in the HTTP the header. – roeland Nov 20 '15 at 03:56
  • Exactly how it looks. Kudos for the link to the article about Unicode in Perl! My head is swimming now! – Bob Nov 20 '15 at 04:04

1 Answers1

0

Changed http header to -charset=>'utf-8', now it works perfectly.

Bob
  • 165
  • 2
  • 13