How to match cjk characters with sed?

Question

I'd like to match CJK characters. But the following regex [[:alpha:]]\+ does not work. Does anybody know to match CJK characters?

$ echo '程 a b' | sed -e 's/\([[:alpha:]]\+\)/x\1/g'
程 xa xb

The desired the output is x程 a b.

I only want to match CJK characters but not other printable characters. — user1424739, Jun 16 '19 at 23:18
This might help: https://stackoverflow.com/a/23189067/3776858 — Cyrus, Jun 16 '19 at 23:27
From my own experience, it is much easier with Perl. If there is no rigorous requirement to use sed, do it in Perl, it will be much more comprehensible and concise. — Wiktor Stribiżew, Jun 16 '19 at 23:36

score 2 · Answer 1 · edited Jun 17 '19 at 12:23

2

As @WiktorStribiżew suggests, it will be easier to use perl.
If Perl is your option, please try the following:

echo "程 a b" | perl -CIO -pe 's/([\p{Script_Extensions=Han}])/x\1/g'

Output:

x程 a b

edited Jun 17 '19 at 12:23

daxim

answered Jun 17 '19 at 00:11

tshiono

score 0 · Answer 2 · answered Jun 26 '19 at 20:57

With Perl, your solution will look like

perl -CSD -Mutf8 -pe 's/\p{Han}+/x$&/g' filename

Or, with older Perl versions before 5.20, use a capturing group:

perl -CSD -Mutf8 -pe 's/(\p{Han}+)/x$1/g' filename

To modify file contents inline add -i option:

perl -i -CSD -Mutf8 -pe 's/(\p{Han}+)/x$1/g' filename

NOTES

\p{Han} matches a single Chinese character, \{Han}+ matches chunks of 1 or more Chinese characters
$1 is the backreference to the value captured with (\p{Han}+), $& replaces with the whole match value
-Mutf8 lets Perl recognize the UTF8-encoded characters used directly in your Perl code
-CSD (equivalent to -CIOED) allows input decoding and output re-encoding (it will work for UTF8 encoding).

2 Answers2