1

Inserting a "," in a particular position of a text

From question above, I have gotten errors because a text contained some full-width characters. I deal with some Japanese text data on RHEL server. Question above was a perfect solution for utf-8 text but the UNIX command wont work for Japanese text in SJIS format.

The difference between these two is that utf-8 counts every character as 1 byte and SJIS counts alphabets and numbers as 1 byte and other Japanese characters, such as あ, as 2 bytes. So the sed command only works for utf-8 when inserting ',' in some positions.

My input would be like

aaaああ123あ

And I would like to insert ',' after 3 bytes, 4 bytes and 3 bytes so my desired outcome is

aaa,ああ,123,あ

It is not necessarily sed command if it works on UNIX system. Is there any way to insert ',' after some bytes of data while counting full-width character as 2 bytes and others as 1 bytes.

gggert
  • 166
  • 7
  • What is your `sed` version? Output of `sed --version`. GNU `sed` I think can handle unicodes – Inian Sep 07 '20 at 08:29
  • 3
    You can convert your input from SJIS to UTF-8 using `recode` or `iconv`, use GNU `sed` to manipulate the text data and convert it back to SJIS. See https://superuser.com/q/313032 – Bodo Sep 07 '20 at 08:34
  • @ggert : _insert ',' after 3 bytes, 4 bytes and 3 bytes_ : This does not make any sense, when talking about a character eoncoding, where a character can occupy more than one byte. Do you perhaps mean "after 3 characters, " etc.? Please clarify this issue in your question. – user1934428 Sep 07 '20 at 10:24
  • I think there's a fundamental problem in this statement --"utf-8 counts every character as 1 byte". That's true for the first 127 unicode code points, so you might get away with it. But it's really not something you should rely on. The best way forward here, as others have said, is to convert the data into a format that `sed` can handle on a character-by-character basis. – Kevin Boone Sep 07 '20 at 13:04

1 Answers1

1

is 3 bytes in UTF-8

Depending on the locale GNU sed supports unicode. So reset the locale before running sed commands, and it will work on bytes.

And I would like to insert ',' after 3 bytes, 4 bytes and 3 bytes

Just use a backreference to remember the bytes.

LC_ALL=C sed 's/^\(...\)\(....\)\(...\)/\1,\2,\3,/'

or you could specify numbers:

LC_ALL=C sed 's/^\(.\{3\}\)\(.\{4\}\)\(.\{3\}\)/\1,\2,\3,/'

And cleaner with extended regex extension:

LC_ALL=C sed -E 's/^(.{3})(.{4})(.{3})/\1,\2,\3,/'

The following seems to work in my terminal:

$ <<<'aaaああ123あ' iconv -f UTF-8 -t SHIFT-JIS | LC_ALL=C sed 's/^\(.\{3\}\)\(.\{4\}\)\(.\{3\}\)/\1,\2,\3,/' | iconv -f SHIFT-JIS -t UTF-8
aaa,ああ,123,あ
KamilCuk
  • 120,984
  • 8
  • 59
  • 111