Perl regular expression matching on large Unicode code points

Question

I am trying to replace various characters with either a single quote or double quote.

Here is my test file:

# Replace all with double quotes
＂ fullwidth
“ left
” right
„ low
" normal

# Replace all with single quotes
' normal
‘ left
’ right
‚ low
‛ reverse
` backtick

I'm trying to do this...

perl -Mutf8 -pi -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/ug" test.txt
perl -Mutf8 -pi -e 's/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/ug' text.txt

But only the backtick character gets replaced properly. I think it has something to do with the other code points being too large, but I cannot find any documentation on this.

Here I have a one-liner which dumps the Unicode code points, to verify they match my regular expression.

$ awk -F\  '{print $1}' test.txt | \
    perl -C7 -ne 'for(split(//)){print sprintf("U+%04X", ord)." ".$_."\n"}'

U+FF02 ＂
U+201C “
U+201D ”
U+201E „
U+0022 "

U+0027 '
U+2018 ‘
U+2019 ’
U+201A ‚
U+201B ‛
U+0060 `

Why isn't my regular expression matching?

tchrist · Accepted Answer · 2017-06-01T17:06:41.277

23

It isn’t matching because you forgot the -CSAD in your call to Perl, and don’t have $PERL_UNICODE set in your environment. You have only said -Mutf8 to announce that your source code is in that encoding. This does not affect your I/O.

You need:

$ perl -CSAD -pi.orig -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/g" test.txt

I do mention this sort of thing in this answer a couple of times.

edited Jun 01 '17 at 17:06

answered Oct 01 '12 at 20:49

tchrist

78,834
30
123
180

@tchrist, please correct your answer by replacing -CSAD with -CSD. I do not have the editing powers to do so. – Hans Deragon Jun 01 '17 at 17:03
@HansDeragon Done. – tchrist Jun 01 '17 at 17:06

ikegami · Answer 2 · 2012-10-01T21:18:21.790

8

With use utf8;, you told Perl your source code is UTF-8. This is useless (though harmless) since you've limited your source code to ASCII.

With /u, you told Perl to use the Unicode definitions of \s, \d, \w. This is useless (though harmless) since you don't use any of those patterns.

You did not decode your input, so your inputs consists solely of bytes, so most of the characters in your class (e.g. \x{2018}) can't possibly match anything. You need to decode your input (and of course, encode your output). Using -CSD will likely do this.

perl -CSD -i -pe'
   s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/\x27/g;
   s/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/g;
' text.txt

edited Oct 01 '12 at 21:18

answered Oct 01 '12 at 20:57

ikegami

367,544
15
269
518

1

I hate having to figure out how to quote things in the shell. I usually just opt for the `\x27` trick instead. – tchrist Oct 01 '12 at 21:05
I just did `'` ⇒ `'\''` without thinking, buy yeah, `'` ⇒ `\x27` is a good idea. – ikegami Oct 01 '12 at 21:07
I think you mean "need to decode you **input**", and probably also then "need to encode your output". – tchrist Oct 01 '12 at 21:14
@tchirst, Typo fixed. Addition added. – ikegami Oct 01 '12 at 21:18

Perl regular expression matching on large Unicode code points

2 Answers2

Linked

Related