Why doesn't this sed expression remove lines with Korean as expected?

Question

I combined these two answers to produce this sed command:

sed '/[\u3131-\uD79D]/d' text.txt  # Remove all lines with Korean characters

However it outputs only the lines with Korean characters:

$ cat text.txt
1
00:00:00,000 --> 00:00:05,410
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게
Hello, today we're going to explain how to use the

$ sed '/[\u3131-\uD79D]/d' text.txt  # Korean characters pattern fails
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게

$ sed '/Hello/d' text.txt           # Simple pattern works
1
00:00:00,000 --> 00:00:05,410
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게

$ sed '/[0-9]/d' text.txt           # Simple range works
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게
Hello, today we're going to explain how to use the

$ sed --version                     # Git Bash for Windows 2.33.0.windows.2
sed (GNU sed) 4.8

Is this a bug with sed? I was able to use the equivalent command in gVim successfully:

:g/[\u3131-\uD79D]/d

Luis Guzman · Accepted Answer · 2021-09-09T23:21:07.357

It has to do with the collation order of the expression in the bracket due to sed following POSIX. You need a collation order that sort by numeric Unicode point, C.UTF-8, and then, you need to encode your range characters in utf8. There is an explanation of the details here.

This is how you apply it to your range on a bash shell (I used linux to test it):

$ # first get octal representation of range unicode code points
$ # iconv is to convert to utf-8 in case your locale is not utf-8
$ printf "\u3131\uD79D" | iconv -t utf-8 | od -An -to1
 343 204 261 355 236 235

$ # format it as a sed range
$ printf '\o%s\o%s\o%s-\o%s\o%s\o%s' $(printf "\u3131\uD79D" | iconv -t utf-8 | od -An -to1); echo
\o343\o204\o261-\o355\o236\o235

$ # use the range in sed
$ LC_ALL=C.UTF-8 sed '/[\o343\o204\o261-\o355\o236\o235]/d' text.txt
...
$

Here is the output:

$ LC_ALL=C.UTF-8 sed '/[\o343\o204\o261-\o355\o236\o235]/d' text.txt
1
00:00:00,000 --> 00:00:05,410
Hello, today we're going to explain how to use the

$ sed '/[\u3131-\uD79D]/d' text.txt  # Korean characters pattern fails

$ sed '/Hello/d' text.txt           # Simple pattern works
1
00:00:00,000 --> 00:00:05,410

$ sed '/[0-9]/d' text.txt           # Simple range works
Hello, today we're going to explain how to use the

$

EDIT: helper scrip/functions

This bash script or its functions can be used to obtain a sed unicode range:

#!/bin/bash

# sur - sed unicode range
#
#     Converts a unicode range into an octal utf-8 range suitable for sed
#
# Usage:
#        sur \\u452 \\u490
#
#        sur \\u3131 \\uD79D

to_octal() {
    printf "$1" | iconv -t utf-8 | od -An -to1 | sed 's/ \([0-9][0-9]*\)/\\o\1/g'
}

sur () {
    echo "$(to_octal $1)-$(to_octal $2)"
}

sur $1 $2

To use the script, make sure it is executable and in your PATH. Here is an example on how to use the functions. I just copied and pasted them into a bash shell:

$ to_octal() {
>     printf "$1" | iconv -t utf-8 | od -An -to1 | sed 's/ \([0-9][0-9]*\)/\\o\1/g'
> }
$
$ sur () {
>     echo "$(to_octal $1)-$(to_octal $2)"
> }
$
$ sur \\u3131 \\uD79D
\o343\o204\o261-\o355\o236\o235
$ sur \\u452 \\u490
\o321\o222-\o322\o220
$

Why doesn't this sed expression remove lines with Korean as expected?

1 Answers1