Remove Unicode characters from textfiles - sed , other Bash/shell methods

Question

How do I remove Unicode characters from a bunch of text files in the terminal?

I've tried this, but it didn't work:

sed 'g/\u'U+200E'//' -i *.txt

I need to remove these Unicode characters from the text files:

U+0091 - sort of weird "control" space
U+0092 - same sort of weird "control" space
A0 - non-space break
U+200E - left to right mark

What encoding is your text files in? – unwind Dec 19 '11 at 14:08 — unwind, Dec 19 '11 at 14:08

score 68 · Answer 1 · edited Sep 21 '22 at 21:17

68

Clear all non-ASCII characters of file.txt:

$ iconv -c -f utf-8 -t ascii file.txt
$ strings file.txt

Options:

-c # discard unconvertible characters
-f # from ENCODING
-t # to ENCODING

edited Sep 21 '22 at 21:17

wisbucky

33,218
10
150
101

answered Dec 19 '11 at 14:12

kev

155,172
47
273
272

1

i want to keep the unicode encoding. sorry, so iconv is not the solution. – alvas Dec 19 '11 at 14:40
2

Why can't you just run it in reverse? tempf=$(mktemp) iconv -c -f utf-8 -t ascii file.txt > $tempf iconv -f ascii -t utf-8 $tempf > file.txt – David Gladfelter Feb 21 '14 at 16:32
2

UTF-8 is a valid subset of ASCII. The reverse transformation keeps the file unchanged. – Eric Bréchemier Sep 08 '14 at 09:13
You have just changed my life, kev! You're The Man. Thanks! – Krzysztof Jabłoński Oct 03 '14 at 15:17
This was it for me. Was breaking my automation with this nonsense. Now it works again! – rylectro Jul 02 '20 at 04:00
@alvas what about `iconv -c -f utf-8 -t utf-8 file.txt` – kawu Apr 25 '23 at 08:18

score 55 · Accepted Answer · edited Aug 29 '20 at 12:09

55

If you want to remove only particular characters and you have Python, you can:

CHARS=$(python -c 'print u"\u0091\u0092\u00a0\u200E".encode("utf8")')
sed 's/['"$CHARS"']//g' < /tmp/utf8_input.txt > /tmp/ascii_output.txt

edited Aug 29 '20 at 12:09

Peter Mortensen

30,738
21
105
131

answered Dec 19 '11 at 14:19

Michał Šrajer

30,364
7
62
85

Maybe not the prettiest. But it worked very well for me. By constructing the CHARS variable, it made the sed easier to read, and CHARS variable can be easily maintained. Choroba's answer also works, so I guess it's a matter of taste (and if you have Python handy). – Paulb Feb 17 '14 at 13:03
2

It is an alternative code of python part.`python -c 'print "".join(map(unichr, range(0x80, 0xa0) + range(0x2000, 0x200f))).encode("utf-8")'` – ENDOH takanao Mar 17 '15 at 04:15
2

in recent linux os'es you can write unicode characters by pressing Ctrl+Shift+u followed by the numeric code and , e.g. `Ctrl+Shift+u 0019 ⏎` – smoebody Apr 26 '16 at 11:01
Is it faster to do an in place edit if all the text is separted by new lines than using `< path > newpath`? Have a massive file, why I ask.. – Joshua Robinson Sep 20 '16 at 10:28
Comment by kev on Chobra's answer is what I found most useful. You can plug that with this answer to get `CHARS=$(echo -ne '\u200c')` followed by the same `sed` line. – Hrishikesh Feb 17 '18 at 14:21

score 35 · Answer 3 · edited Aug 29 '20 at 12:04

35

For UTF-8 encoding of Unicode, you can use this regular expression for sed:

sed 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//g'

edited Aug 29 '20 at 12:04

Peter Mortensen

30,738
21
105
131

answered Dec 19 '11 at 14:26

choroba

231,213
25
204
289

4

how do i find the mapping from `U+...` to `\xc2\...` ? – alvas Dec 19 '11 at 14:37
The | doesn't work for me this way in sed, so I had to string a series of sed commands with single replaces together. – Jonathan W. Oct 27 '19 at 01:41
@JonathanW. Wasn't it rather the missing `/g`? – choroba Oct 27 '19 at 08:47
2

There are quite a few differences between systems here. MacOS doesn't support the \xNN codes and RHEL requires the use of the -r option for sed to be able to use them. Just something to keep in mind in case you're developing a script on one system and deploying to another (generally not the best idea, but that's never prevented people from doing so) :) – Joe Dyndale Sep 15 '20 at 10:33
@JonathanW. maybe you want to add `-e` to the sed command in order to use pipes as within regex – OldFart May 18 '22 at 13:10
@OldFart, I believe `-r` would have worked, as @JoeDyndale mentioned. – Jonathan W. May 19 '22 at 16:20

score 16 · Answer 4 · edited Aug 29 '20 at 12:08

16

Use iconv:

iconv -f utf8 -t ascii//TRANSLIT < /tmp/utf8_input.txt > /tmp/ascii_output.txt

This will translate characters like "Š" into "S" (most similar looking ones).

edited Aug 29 '20 at 12:08

Peter Mortensen

30,738
21
105
131

answered Dec 19 '11 at 14:05

Michał Šrajer

30,364
7
62
85

1

they are not ascii, i want to keep them in utf8 but i want to replace these weird spaces into normal null string `""` – alvas Dec 19 '11 at 14:09
Not what the OP wanted, but I had a need to convert a unicode line-seperator (u2028) into a newline. I would have preferred to use iconv, but I couldn't figure out how to do it. Is there a way? – Chris Quenelle Oct 01 '13 at 18:05
the -c flag is useful to discard characters that cannot be transliterated, avoiding a fatal error. – Eric Bréchemier Sep 08 '14 at 09:10
1

As an alternative to -c, --unicode-subst allows to specify a pattern for the substitution of the character, instead of removing it completely. For example, --unicode-subst='?' allows to replace non-identifiable characters with a question mark. – Eric Bréchemier Sep 08 '14 at 10:31
@ChrisQuenelle - its years later but did you ever solve your problem? I have the same issue. – JBCP Mar 12 '15 at 19:40
It's been so long, I don't recall how I solved my problem. I think I got iconv to do what I wanted. – Chris Quenelle Mar 16 '15 at 21:35
scrolling down, this was the first answer that worked. – Jeffrey Sep 21 '19 at 14:31

score 2 · Answer 5 · edited Aug 29 '20 at 12:03

2

Convert Swift files from UTF-8 to ASCII:

for file in *.swift; do
    iconv -f utf-8 -t ascii "$file" > "$file".tmp
    mv -f "$file".tmp "$file"
done

Swift auto completion not working in Xcode 6 Beta

edited Aug 29 '20 at 12:03

Peter Mortensen

30,738
21
105
131

answered Jul 12 '14 at 13:56

ma11hew28

121,420
116
450
651

Remove Unicode characters from textfiles - sed , other Bash/shell methods

5 Answers5

Linked