I want to replace certain double width characters found in a file with their single width equivalents using sed
expression. This is not quite working as expected but is expressive of what I want to do (this is in a bash script): I have mixed in alphanumeric ranges with some others I can think of off hand, not sure if this needs to be separated into two different -e
arguments based on if ranges, etc.
sed -e 's,[0-9a-zA-Z()【】-一],[0-9a-zA-Z\(\)\[\]\-\-],g' ${file} > ${file}.cleaned
The files are tsv (tab separated values) text files.
According to the file
command the type is: UTF-8 Unicode text, with CRLF line terminators
or (in another case) UTF-8 Unicode text, with no line terminators
Sample input:
Part Number
123-956-AA
343-213-【E】
XTE-898一(5)
Sample output:
Part Number
123-956-AA
343-213-[E]
XTE-898-(5)
My system is Ubuntu16.04 running in a Docker container built from our base image which is built from phusion/passenger-ruby23:0.9.19
which has a base image (eventually to the base) of ubuntu:16.04
, shell is GNU bash, version 4.3.46(1)-release (x86_64-pc-linux-gnu)
, sed version is sed (GNU sed) 4.2.2
and results from locale
command is:
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
Update:
The chosen solution/answer was 1) to use the y
command (as the other answers also suggested) and in my case, 2) to set the LL_ALL as shown below to avoid the error I was getting with the y
command. It does appear that the range doesn't work for the y
command so all characters must be identified individually (as I previously mistakenly thought)
LC_ALL=en_US.UTF-8 sed 'y/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890()【】-一/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890()[]--' file.tsv
Update 2:
Per the suggestion from the other answerers (one has mysteriously vanished), the locale being set for the system was further investigated as a solution instead of setting the environment variable at the command line. Since this is a Docker image container environment, I've found a solution to put into our base image which solves the problem at the base system level.
I've added to our base Dockerfile:
# Set the locale
RUN locale-gen en_US.UTF-8
ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'
and now the locale
command produces;
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
and now the sed
command works as follows:
sed 'y/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890()【】-一/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890()[]--' file.tsv
As a side note, I wish stackoverflow provided a way to give answer credit to multiple answers since the original 3 answers (again, one vanished) all contributed me getting to the solution but I had to choose only one. This happens often.