2

I have an encoding question and would like to ask for help. I notice if I choose "UTF-8" as encoding, there are (at least) two double quotes " and . But when I choose "ISO-8859-1" as the encoding, I see the latter double quote becomes ¡°, or sometimes for example “.

Could anyone please explain why this is the case? How can match and replace it with " using regexp in perl?

Thanks a lot.

dan04
  • 87,747
  • 23
  • 163
  • 198
Qiang Li
  • 10,593
  • 21
  • 77
  • 148
  • 2
    Define “to choose UTF-8 as encoding”. Do you mean `use utf8` for source code, or `use open qw(:std :utf8)` for streams, or something else altogether? – tchrist Jun 11 '11 at 00:21
  • See also [this answer](http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129). – tchrist Jun 11 '11 at 00:26

2 Answers2

3

ISO-8859-1 is a one-byte-per-character encoding. The fancy Unicode double-quotes are not in the ISO-8859-1 character set. So what you are seeing is a multi-byte character represented as a sequence of ISO-8859-1 bytes.

To match these weird things, see the perlunicode man page, especially the \x{...} and \N{...} escape sequences.

To answer your question, try \x{201C} to match the Unicode LEFT DOUBLE QUOTATION MARK and \x{201D} to match the RIGHT DOUBLE QUOTATION MARK. You missed the latter in your question :-).

[update]

I should have provided my reference... Some nice gentleman in the UK has a page on ASCII and Unicode quotation marks. The plain vanilla ASCII/ISO-8859-1 double-quote is just called QUOTATION MARK.

Nemo
  • 70,042
  • 10
  • 116
  • 153
  • thank you for your answer. :) So what is the name of the other plain double quotation mark? – Qiang Li Jun 11 '11 at 00:17
  • @Qiang: Yes. I added an update with the link I should have included in the first place – Nemo Jun 11 '11 at 00:20
  • Best to `use charnames ":full"` and thence the likes of `\N{EFT DOUBLE QUOTATION MARK}` and the like. I mislike magic numbers in code, and 0x201C is certainly one such. – tchrist Jun 11 '11 at 00:22
  • @Qiang: You should get [the uninames script](http://training.perl.com/scripts/uninames) if you want to know the names of code points. There is a lot more unicode-related stuff there [in that directory](http://training.perl.com/scripts/), too. – tchrist Jun 11 '11 at 00:24
-1

May be this Old post will help..

Community
  • 1
  • 1
ppant
  • 752
  • 9
  • 19