encoding question in perl

Question

I have an encoding question and would like to ask for help. I notice if I choose "UTF-8" as encoding, there are (at least) two double quotes " and “. But when I choose "ISO-8859-1" as the encoding, I see the latter double quote becomes ¡°, or sometimes for example â€œ.

Could anyone please explain why this is the case? How can match “ and replace it with " using regexp in perl?

Thanks a lot.

Define “to choose UTF-8 as encoding”. Do you mean `use utf8` for source code, or `use open qw(:std :utf8)` for streams, or something else altogether? — tchrist, Jun 11 '11 at 00:21
See also [this answer](http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129). — tchrist, Jun 11 '11 at 00:26

Nemo · Accepted Answer · 2011-06-11T00:19:40.447

3

ISO-8859-1 is a one-byte-per-character encoding. The fancy Unicode double-quotes are not in the ISO-8859-1 character set. So what you are seeing is a multi-byte character represented as a sequence of ISO-8859-1 bytes.

To match these weird things, see the perlunicode man page, especially the \x{...} and \N{...} escape sequences.

To answer your question, try \x{201C} to match the Unicode LEFT DOUBLE QUOTATION MARK and \x{201D} to match the RIGHT DOUBLE QUOTATION MARK. You missed the latter in your question :-).

[update]

I should have provided my reference... Some nice gentleman in the UK has a page on ASCII and Unicode quotation marks. The plain vanilla ASCII/ISO-8859-1 double-quote is just called QUOTATION MARK.

edited Jun 11 '11 at 00:19

answered Jun 11 '11 at 00:12

Nemo

70,042
10
116
153

thank you for your answer. :) So what is the name of the other plain double quotation mark? – Qiang Li Jun 11 '11 at 00:17
@Qiang: Yes. I added an update with the link I should have included in the first place – Nemo Jun 11 '11 at 00:20
Best to `use charnames ":full"` and thence the likes of `\N{EFT DOUBLE QUOTATION MARK}` and the like. I mislike magic numbers in code, and 0x201C is certainly one such. – tchrist Jun 11 '11 at 00:22
@Qiang: You should get [the uninames script](http://training.perl.com/scripts/uninames) if you want to know the names of code points. There is a lot more unicode-related stuff there [in that directory](http://training.perl.com/scripts/), too. – tchrist Jun 11 '11 at 00:24

score -1 · Answer 2 · edited May 23 '17 at 10:29

-1

May be this Old post will help..

edited May 23 '17 at 10:29

Community

1
1

answered Jun 14 '11 at 09:46

ppant

752
9
19

encoding question in perl

2 Answers2