1

I'd like to match CJK characters. But the following regex [[:alpha:]]\+ does not work. Does anybody know to match CJK characters?

$ echo '程 a b' | sed -e 's/\([[:alpha:]]\+\)/x\1/g'
程 xa xb

The desired the output is x程 a b.

user1424739
  • 11,937
  • 17
  • 63
  • 152

2 Answers2

2

As @WiktorStribiżew suggests, it will be easier to use .
If Perl is your option, please try the following:

echo "程 a b" | perl -CIO -pe 's/([\p{Script_Extensions=Han}])/x\1/g'

Output:

x程 a b
daxim
  • 39,270
  • 4
  • 65
  • 132
tshiono
  • 21,248
  • 2
  • 14
  • 22
0

With Perl, your solution will look like

perl -CSD -Mutf8 -pe 's/\p{Han}+/x$&/g' filename

Or, with older Perl versions before 5.20, use a capturing group:

perl -CSD -Mutf8 -pe 's/(\p{Han}+)/x$1/g' filename

To modify file contents inline add -i option:

perl -i -CSD -Mutf8 -pe 's/(\p{Han}+)/x$1/g' filename

NOTES

  • \p{Han} matches a single Chinese character, \{Han}+ matches chunks of 1 or more Chinese characters
  • $1 is the backreference to the value captured with (\p{Han}+), $& replaces with the whole match value
  • -Mutf8 lets Perl recognize the UTF8-encoded characters used directly in your Perl code
  • -CSD (equivalent to -CIOED) allows input decoding and output re-encoding (it will work for UTF8 encoding).
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563