Special characters with extra characters showing up before them

Question

I have been busting my noggin trying to figure out how to handle some special common characters that are input by users in forms. Examples of what I mean are the copyright sign, registered sign, fraction 1/2, fraction 1/4, etc. So here is what happens:

Users enter these characters, and they are saved into a regular text file. No problem. They are saved in their true and pure form. Now when we grab them with the Perl CGI file, and they are displayed in a browser, I get all these "A"s and other A-characters with markings above them. I am running a subroutine on the string to try to convert these from Unicode matches into HTML entities, but it doesn't seem to be working.

Perl Code:

#string with special characters
$special_chars=encodebc($special_chars);

sub encodebc{
$answer=$_[0];
$answer =~ s/:://gi;
$answer =~ s/\x{0022}/&quot;/g;
$answer =~ s/\x{0027}/&#039;/g;
$answer =~ s/\x{0026}/&amp;/g;
$answer =~ s/\x{003C}/&lt;/g;
$answer =~ s/\x{003E}/&gt;/g;
$answer =~ s/\x{0060}/&#096;/g;
$answer =~ s/\x{007B}/&#123;/g;
$answer =~ s/\x{007D}/&#125;/g;
$answer =~ s/\x{00A9}/&copy;/g;
$answer =~ s/\x{00AE}/&reg;/g;
$answer =~ s/\x{00AB}/&laquo;/g;
$answer =~ s/\x{00BB}/&raquo;/g;
$answer =~ s/\x{00A2}/&cent;/g;
$answer =~ s/\x{00B0}/&deg;/g;
$answer =~ s/\x{00B2}/&sup2;/g;
$answer =~ s/\x{00B3}/&sup3;/g;
$answer =~ s/\x{00B5}/&micro;/g;
$answer =~ s/\x{00BC}/&frac14;/g;
$answer =~ s/\x{00BD}/&frac12;/g;
$answer =~ s/\x{00BE}/&frac34;/g;
$answer =~ s/\x{00E1}/&aacute;/g;
$answer =~ s/\x{00E9}/&eacute;/g;
$answer =~ s/\x{00F1}/&ntilde;/g;
$answer =~ s/\x{00F5}/&otilde;/g;
$answer =~ s/\x{00F8}/&oslash;/g;
return $answer;
}

In the above code, I'm matching for two-byte characters in Unicode...so I'm not understanding where the "A" characters are coming from.

Before you downvote me, please know I have spent hours upon hours working on this and reading trying to figure this out. I appreciate anyone who can help me out here.

*very* long answer about handling Unicode in Perl here: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default — roeland, Nov 20 '15 at 03:52
But in short: “their true and pure form” doesn't really mean anything. It has to be stored using some encoding like UTF-8 or UTF-16. If your text looks like `'Ã¡Ã¢Ã£'` then probably you're sending UTF-8 but declaring it as ISO-8859-1 in the HTTP the header. — roeland, Nov 20 '15 at 03:56
Exactly how it looks. Kudos for the link to the article about Unicode in Perl! My head is swimming now! — Bob, Nov 20 '15 at 04:04

score 0 · Accepted Answer · answered Nov 20 '15 at 04:14

0

Changed http header to -charset=>'utf-8', now it works perfectly.

answered Nov 20 '15 at 04:14

Bob

165
2
13

Special characters with extra characters showing up before them

1 Answers1