How to use sed expression for substituting double width characters with single width

Question

I want to replace certain double width characters found in a file with their single width equivalents using sed expression. This is not quite working as expected but is expressive of what I want to do (this is in a bash script): I have mixed in alphanumeric ranges with some others I can think of off hand, not sure if this needs to be separated into two different -e arguments based on if ranges, etc.

sed -e 's,[０-９ａ-ｚＡ-Ｚ（）【】－一],[0-9a-zA-Z\[\]\-\-],g' ${file} > ${file}.cleaned

The files are tsv (tab separated values) text files. According to the file command the type is: UTF-8 Unicode text, with CRLF line terminators or (in another case) UTF-8 Unicode text, with no line terminators

Sample input:

Part Number
123-９56-AＡ
343-213-【E】
XTE-898一（5）

Sample output:

Part Number
123-956-AA
343-213-[E]
XTE-898-(5)

My system is Ubuntu16.04 running in a Docker container built from our base image which is built from phusion/passenger-ruby23:0.9.19 which has a base image (eventually to the base) of ubuntu:16.04, shell is GNU bash, version 4.3.46(1)-release (x86_64-pc-linux-gnu), sed version is sed (GNU sed) 4.2.2 and results from locale command is:

LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

Update:

The chosen solution/answer was 1) to use the y command (as the other answers also suggested) and in my case, 2) to set the LL_ALL as shown below to avoid the error I was getting with the y command. It does appear that the range doesn't work for the y command so all characters must be identified individually (as I previously mistakenly thought)

LC_ALL=en_US.UTF-8 sed 'y/ａｂｃｄｅｆｇｈｉｊｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚＡＢＣＤＥＦＧＨＩＪＫＬＭＮＯＰＱＲＳＴＵＶＷＸＹＺ１２３４５６７８９０（）【】－一/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890()[]--' file.tsv

Update 2:

Per the suggestion from the other answerers (one has mysteriously vanished), the locale being set for the system was further investigated as a solution instead of setting the environment variable at the command line. Since this is a Docker image container environment, I've found a solution to put into our base image which solves the problem at the base system level.

I've added to our base Dockerfile:

# Set the locale
RUN locale-gen en_US.UTF-8
ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'

and now the locale command produces;

LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

and now the sed command works as follows:

sed 'y/ａｂｃｄｅｆｇｈｉｊｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚＡＢＣＤＥＦＧＨＩＪＫＬＭＮＯＰＱＲＳＴＵＶＷＸＹＺ１２３４５６７８９０（）【】－一/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890()[]--' file.tsv

As a side note, I wish stackoverflow provided a way to give answer credit to multiple answers since the original 3 answers (again, one vanished) all contributed me getting to the solution but I had to choose only one. This happens often.

Please add sample input and your desired output for that sample input to your question. — Cyrus, May 19 '18 at 22:55
These are double byte characters (a single character, not an extra space character). The files are tsv (tab separated values) text files. — Streamline, May 19 '18 at 23:36

Sundeep · Accepted Answer · 2018-05-20T05:37:32.793

2

If perl is okay:

$ perl -Mopen=locale -Mutf8 -pe 'tr/０-９ａ-ｚＡ-Ｚ（）【】－一/0-9a-zA-Z()[]--/' ip.txt
Part Number
123-956-AA
343-213-[E]
XTE-898-(5)

-Mopen=locale -Mutf8 to specify locale as utf8
tr/０-９ａ-ｚＡ-Ｚ（）【】－一/0-9a-zA-Z()[]--/ translate characters as required, can also use y instead of tr

sed (GNU sed) 4.2.2 can be used, but it doesn't support ranges

$ # simulating OP's POSIX locale
$ echo '91Ａ９foo' | LC_ALL=C sed 'y/Ａ９/A9/'
sed: -e expression #1, char 12: strings for `y' command are different lengths

$ # changing to a utf8 locale
$ echo '91Ａ９foo' | LC_ALL=en_US.UTF-8 sed 'y/Ａ９/A9/'
91A9foo

Further reading: https://wiki.archlinux.org/index.php/locale

edited May 20 '18 at 05:37

answered May 20 '18 at 03:44

Sundeep

23,246
2
28
103

Ah ha, adding the `LC_ALL=en_US.UTF-8` now allows me to use `y` and to use the range definitions vs individual character specification just like your example suggests! Nice. This solves this issue as originally described. One reason I wanted a `sed` solution was because I have other `sed` expressions cleaning up the file that I wanted to add this to so it would be all one command. One of those expressions is `-e 's,[\x00-\x08\x0a-\x1f\x7f]\+,,g' ` to remove control characters which now fails with the `LC_ALL=en_US.UTF-8`. I can split this up into two `sed` commands unless you have suggestion. – Streamline May 20 '18 at 04:28
@Streamline as per https://stackoverflow.com/a/36991329, try `s,[\x00\x01-\x08\x0a-\x1f\x7f]\+,,g` – Sundeep May 20 '18 at 05:00
great - initial testing does omit the error so I've got it all in one command now. I'll have to run more tests to see what the affect of that change is. – Streamline May 20 '18 at 05:13
Actually, I just realized that the double byte characters found in the range aren't getting replaced as I had thought they where in my first test of your `sed` answer. Not sure how I missed it but if you add, for example, a `Ｂ` to one of the sample input values, it doesn't get replaced with a 'B'. If I specify all characters in the range it does work. Is there something that needs to be done to get the range based transliteration working with `sed -y`? – Streamline May 20 '18 at 05:33
yeah, I too was just checking that.. because, I remember in the past I had to use perl instead of sed for some reason... sed's y doesn't support range.. – Sundeep May 20 '18 at 05:34
you could either do this with perl and do the rest with sed or convert the whole program to perl.. for ex: `tr/\x00-\x08\x0a-\x1f\x7f//d` for removing control characters – Sundeep May 20 '18 at 05:48

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

1

Use the y command:

y/source-chars/dest-chars/

Transliterate any characters in the pattern space which match any of the source-chars with the corresponding character in dest-chars.

Example: transliterate 'a-j' into '0-9':
$ echo hello world | sed 'y/abcdefghij/0123456789/'
74llo worl3
(The / characters may be uniformly replaced by any other single character within any given y command.)

Instances of the / (or whatever other character is used in its stead), \, or newlines can appear in the source-chars or dest-chars lists, provide that each instance is escaped by a \. The source-chars and dest-chars lists must contain the same number of characters (after de-escaping).

See the tr command from GNU coreutils for similar functionality.

Just keep in mind that you have to spell out each character, ranges won't work here.

So:

sed -e 'y/０１２３４５６７８９ａｂｃｄｅｆｇｈ[...]/0123456789abcdefgh[...]/'

I'll let you spell out all the other characters.

edited Jun 20 '20 at 09:12

Community

1
1

answered May 19 '18 at 23:16

1

Are you sure it works well with multibyte characters? – Casimir et Hippolyte May 19 '18 at 23:32
@CasimiretHippolyte I tested it before answering. – May 19 '18 at 23:38
Me too, but I'm not totally convinced, perhaps a wolf behind the bush (for which versions of sed it works?). – Casimir et Hippolyte May 19 '18 at 23:42
@CasimiretHippolyte Locale support, including UTF-8 support on systems that support UTF-8 locales, is required by POSIX. Any version of sed where it doesn't work is broken. – May 19 '18 at 23:47
Whenever I try using the `y` command with `sed` for this, it fails thinking the character expression is not the right length, I think because one of the characters is acting like a control character but I tried it even it just `a` such as with `sed -e 'y/ａＡ/aA/' ${file} > ${file}.tmp` and I get `sed: -e expression #1, char 12: strings for 'y'command are different lengths`. This is `sed (GNU sed) 4.2.2` on Ubuntu16.04 – Streamline May 20 '18 at 00:16
@Streamline I just checked GNU sed 4.2.2, and it supports locales just fine. Your exact command works for me with that version. Please check that you're using a correct locale (run the `locale` command). If you are, I can check an Ubuntu system soon-ish, but probably not today. – May 20 '18 at 00:28
@Streamline Right, you didn't set the locale. Although you got it working already, I'd recommend not setting the locale for this specific command, but setting your locale globally. You're using UTF-8 files, your system should be set up to process them correctly. You can use `/etc/default/locale` for this, see https://help.ubuntu.com/community/Locale. – May 20 '18 at 06:44
Thanks. This is a Docker image created Ubuntu16.04 instance and setting up things like this need to be scripted, so I will investigate and test something along these lines https://stackoverflow.com/a/28406007/56082 for setting the locale when the image is built. The core application running in this Docker container is a Ruby on Rails app and I haven't run into any locale issues like this before so the Rails app environment must be setting up the necessary locale requirements on its own and the fact that it wasn't set at the level below the Rails app hasn't come up prior to now. – Streamline May 20 '18 at 16:52

How to use sed expression for substituting double width characters with single width

2 Answers2