0

When I try insert a copyright symbol value to variable in perl it converted to another symbol

I need

$a=©;

But I got like this

$a =©

Please give me the solutions

Toto
  • 89,455
  • 62
  • 89
  • 125
user01
  • 81
  • 1
  • 5

3 Answers3

5

OK - you need to know something about character encodings.

There is something called a character set - it is a named group of valid characters ("A", "z", "1", "£" etc). A perl string generally holds characters. Perl's character set includes everything in the world (and then more).

Now, each character in the set is given a number (code-point) so we know what we are talking about (65="A" in many sets, but not necessarily all of them). Traditionally, different countries / computer companies came up with their own codes for some characters (in the UK, "£" was considered important to have, less so in the USA). So - we need to know what character-set we want to use when exchanging information.

However, when we write to a file or send a message over a network then we write bytes, which can only hold numbers 0-255. So - what do we do with characters whose code-points are greater than 255?

We need an encoding. This is a set of rules that say how to turn our code-points into bytes.

Unicode is a character set containing pretty much every written symbol ever used (they keep adding to it too). It has a number of encodings perhaps the most common of which is UTF-8. The UTF-8 encoding uses multiple bytes for numbers larger than 127 (google if you care why).

ISO-8859-1 is a European-based character-set and encoding (one byte per character). It was revised in ISO-8859-15 which among other things introduced the Euro "€" symbol. Both hold only a tiny fraction of the characters in the Unicode standard (no Arabic, Chinese, smiley faces etc).

There is no way to tell a file in ISO-8859-1 from one on ISO-8859-15 without understanding what it is saying. In one, a byte 0xA4 means "¤" in the other "€".

It is sometimes possible to spot a UTF-8 file since it has certain rules for how to generate large codepoints.

In your case, those two characters for the copyright symbol? They're a UTF-8 encoding of that character. You presumably typed it with ISO-8859-something or Windows-something.

Below is a small script to illustrate what I mean. It prints out "test©" in two encodings showing the bytes (octets) used for both. Your terminal will only display one successfully.

#!/usr/bin/perl
use strict;
use warnings;
use Encode qw(encode);

print_charcodes('UTF-8', 'test'.chr(169));
print_charcodes('ISO-8859-1', 'test'.chr(169));
exit;

sub print_charcodes {
    my ($enc, $chars) = @_;
    my $octets = encode($enc, $chars, Encode::FB_CROAK);
    my @codes = map { ord $_ } split('', $octets);
    print sprintf('%11s : ',$enc), join(" ", @codes), " : $octets", "\n"
}

Phew - that's the absolute minimum you need to know to cope with characters in the 21st century. There's a huge amount of detail when you start trying to process this stuff (what's a number? what's punctuation, how do I lower-case?). Read this post for the gory details. Oh - and when you do, remember Perl is supposed to be better at this than most languages.

P.S. - Unicode experts. Yes I realise this is over-simplifying a lot of fiddly detail, but I wanted to convey the basics without getting quite as scary as the linked post.

Community
  • 1
  • 1
Richard Huxton
  • 21,516
  • 3
  • 39
  • 51
1

A question is how you got what you got:

In UTF-8, the © is represented as a two byte character A9 C2.

In Windows Code Page 1250 which is the default code page in the U.S., A9 C2 represents two characters: Â and ©.

You didn't say how or what you did. Did you type your variable in as $a = "©", but it displays instead as $a = "Ä©", or you typed your Perl script in one place, but it shows up as the wrong variable elsewhere. Or, if you're running a Perl script, typed in © as input, but got Ä© as output.

I'm not going to repeat Richard Huxton's explanation, but you do need to understand how characters are represented.

Community
  • 1
  • 1
David W.
  • 105,218
  • 39
  • 216
  • 337
0

What's the context? If you're printing this to the console the other commentators are right, you need to use the proper encoding and then $a="©"; should do just fine. If you're writing to a web page it's probably wiser to use $a="©" so the browser will interpret it correctly.

Creede
  • 153
  • 1
  • 11