When I try insert a copyright symbol value to variable in perl it converted to another symbol
I need
$a=©;
But I got like this
$a =©
Please give me the solutions
When I try insert a copyright symbol value to variable in perl it converted to another symbol
I need
$a=©;
But I got like this
$a =©
Please give me the solutions
OK - you need to know something about character encodings.
There is something called a character set - it is a named group of valid characters ("A", "z", "1", "£" etc). A perl string generally holds characters. Perl's character set includes everything in the world (and then more).
Now, each character in the set is given a number (code-point) so we know what we are talking about (65="A" in many sets, but not necessarily all of them). Traditionally, different countries / computer companies came up with their own codes for some characters (in the UK, "£" was considered important to have, less so in the USA). So - we need to know what character-set we want to use when exchanging information.
However, when we write to a file or send a message over a network then we write bytes, which can only hold numbers 0-255. So - what do we do with characters whose code-points are greater than 255?
We need an encoding. This is a set of rules that say how to turn our code-points into bytes.
Unicode is a character set containing pretty much every written symbol ever used (they keep adding to it too). It has a number of encodings perhaps the most common of which is UTF-8. The UTF-8 encoding uses multiple bytes for numbers larger than 127 (google if you care why).
ISO-8859-1 is a European-based character-set and encoding (one byte per character). It was revised in ISO-8859-15 which among other things introduced the Euro "€" symbol. Both hold only a tiny fraction of the characters in the Unicode standard (no Arabic, Chinese, smiley faces etc).
There is no way to tell a file in ISO-8859-1 from one on ISO-8859-15 without understanding what it is saying. In one, a byte 0xA4 means "¤" in the other "€".
It is sometimes possible to spot a UTF-8 file since it has certain rules for how to generate large codepoints.
In your case, those two characters for the copyright symbol? They're a UTF-8 encoding of that character. You presumably typed it with ISO-8859-something or Windows-something.
Below is a small script to illustrate what I mean. It prints out "test©" in two encodings showing the bytes (octets) used for both. Your terminal will only display one successfully.
#!/usr/bin/perl
use strict;
use warnings;
use Encode qw(encode);
print_charcodes('UTF-8', 'test'.chr(169));
print_charcodes('ISO-8859-1', 'test'.chr(169));
exit;
sub print_charcodes {
my ($enc, $chars) = @_;
my $octets = encode($enc, $chars, Encode::FB_CROAK);
my @codes = map { ord $_ } split('', $octets);
print sprintf('%11s : ',$enc), join(" ", @codes), " : $octets", "\n"
}
Phew - that's the absolute minimum you need to know to cope with characters in the 21st century. There's a huge amount of detail when you start trying to process this stuff (what's a number? what's punctuation, how do I lower-case?). Read this post for the gory details. Oh - and when you do, remember Perl is supposed to be better at this than most languages.
P.S. - Unicode experts. Yes I realise this is over-simplifying a lot of fiddly detail, but I wanted to convey the basics without getting quite as scary as the linked post.
A question is how you got what you got:
In UTF-8, the ©
is represented as a two byte character A9 C2
.
In Windows Code Page 1250 which is the default code page in the U.S., A9 C2
represents two characters: Â
and ©
.
You didn't say how or what you did. Did you type your variable in as $a = "©"
, but it displays instead as $a = "Ä©"
, or you typed your Perl script in one place, but it shows up as the wrong variable elsewhere. Or, if you're running a Perl script, typed in ©
as input, but got Ä©
as output.
I'm not going to repeat Richard Huxton's explanation, but you do need to understand how characters are represented.
What's the context? If you're printing this to the console the other commentators are right, you need to use the proper encoding and then $a="©";
should do just fine. If you're writing to a web page it's probably wiser to use $a="©"
so the browser will interpret it correctly.