How to replace Unicode characters with ASCII

Question

I have the following command to replace Unicode characters with ASCII ones.

sed -i 's/Ã/A/g'

The problem is Ã isn't recognized by the sed command in my Unix environment so I'd assume you replace it with its hexadecimal value. What would the syntax look like if I were to use C3 instead?

I'm using this command as a template for other characters i'd like to replace with blank spaces such as:

sed -i 's/©/ /g'

you mean like this? http://stackoverflow.com/questions/22450563/sed-matching-unicode-blocks-with — Leo, Nov 21 '14 at 00:35
What character set does your terminal use? And what encoding does the input text use? Ã in UTF-8 is 0xC3 0x83, and character 0x83 is a control code in ISO 8859-1, so that might be a problem. I suppose you can’t just set `LANG=en_US.UTF-8` on your system. — yellowantphil, Nov 21 '14 at 03:03

ajaaskel · Answer 1 · 2014-11-21T07:56:21.507

18

It is possible to use hex values in "sed".

echo "Ã" | hexdump -C
00000000  c3 83 0a                                          |...|
00000003

Ok, that character is two byte combination "c3 83". Let's replace it with single byte "A":

echo "Ã" |sed 's/\xc3\x83/A/g'
A

Explanation: \x indicates for "sed" that a hex code follows.

edited Nov 21 '14 at 07:56

answered Nov 21 '14 at 07:41

ajaaskel

1,639
12
12

Usually I would write those with <<< but piping gives better idea for an average reader what's going on. – ajaaskel Nov 21 '14 at 07:43
What do you mean "write them with <<<"? – isomorphismes Dec 28 '15 at 02:14
1

In case you were wondering what the `0a` in the hexdump was, it is the `LF` character from the `echo`. That's why it's ignored. Or you could use `echo -n` to not print the `LF`. – wisbucky May 05 '16 at 17:55
I had to pass all the three parts (not two) to sed to successfully replace 'e2 80 af' character. Can this be a general rule? – ka3ak Dec 24 '17 at 07:51

score 8 · Answer 2 · answered Nov 21 '14 at 00:36

8

You can use iconv:

iconv -f utf-8 -t ascii//translit

answered Nov 21 '14 at 00:36

midori

4,807
5
34
62

3

You mean GNU iconv. Not all versions of iconv support transliteration. – Nov 21 '14 at 00:42
1

Yes, but he can give it a try – midori Nov 21 '14 at 00:44
Thanks but i'm using this as a template to create other sed commands that will replace certain characters with blank spaces for example: sed -i 's/©/ /g' – Sandeep Johal Nov 21 '14 at 00:45

score 8 · Answer 3 · answered Nov 12 '15 at 15:27

8

Try setting LANG=C and then run it over the Unicode range:
echo "hi ☠ there ☠" | LANG=C sed "s/[\x80-\xFF]//g"

answered Nov 12 '15 at 15:27

score 4 · Answer 4 · edited Dec 04 '15 at 08:09

There is also uconv, from ICU.

Examples:

uconv -x "::NFD; [:Nonspacing Mark:] > ; ::NFC;": to remove accents
uconv -x "::Latin; ::Latin-ASCII;": for a transliteration latin/ascii
uconv -x "::Latin; ::Latin-ASCII; ([^\x00-\x7F]) > ;": for a transliteration latin/ascii and removal of remaining code points > 0x7F
...

echo "À l'école ☠" | uconv -x "::Latin; ::Latin-ASCII; ([^\x00-\x7F]) > ;" gives: A l'ecole

How to replace Unicode characters with ASCII

4 Answers4

Linked