9

I am trying to replace various characters with either a single quote or double quote.

Here is my test file:

# Replace all with double quotes
" fullwidth
“ left
” right
„ low
" normal

# Replace all with single quotes
' normal
‘ left
’ right
‚ low
‛ reverse
` backtick

I'm trying to do this...

perl -Mutf8 -pi -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/ug" test.txt
perl -Mutf8 -pi -e 's/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/ug' text.txt

But only the backtick character gets replaced properly. I think it has something to do with the other code points being too large, but I cannot find any documentation on this.

Here I have a one-liner which dumps the Unicode code points, to verify they match my regular expression.

$ awk -F\  '{print $1}' test.txt | \
    perl -C7 -ne 'for(split(//)){print sprintf("U+%04X", ord)." ".$_."\n"}'

U+FF02 "
U+201C “
U+201D ”
U+201E „
U+0022 "

U+0027 '
U+2018 ‘
U+2019 ’
U+201A ‚
U+201B ‛
U+0060 `

Why isn't my regular expression matching?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
David Chan
  • 7,347
  • 1
  • 28
  • 49

2 Answers2

23

It isn’t matching because you forgot the -CSAD in your call to Perl, and don’t have $PERL_UNICODE set in your environment. You have only said -Mutf8 to announce that your source code is in that encoding. This does not affect your I/O.

You need:

$ perl -CSAD -pi.orig -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/g" test.txt

I do mention this sort of thing in this answer a couple of times.

tchrist
  • 78,834
  • 30
  • 123
  • 180
8

With use utf8;, you told Perl your source code is UTF-8. This is useless (though harmless) since you've limited your source code to ASCII.

With /u, you told Perl to use the Unicode definitions of \s, \d, \w. This is useless (though harmless) since you don't use any of those patterns.

You did not decode your input, so your inputs consists solely of bytes, so most of the characters in your class (e.g. \x{2018}) can't possibly match anything. You need to decode your input (and of course, encode your output). Using -CSD will likely do this.

perl -CSD -i -pe'
   s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/\x27/g;
   s/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/g;
' text.txt
ikegami
  • 367,544
  • 15
  • 269
  • 518