51

How do I remove Unicode characters from a bunch of text files in the terminal?

I've tried this, but it didn't work:

sed 'g/\u'U+200E'//' -i *.txt

I need to remove these Unicode characters from the text files:

U+0091 - sort of weird "control" space
U+0092 - same sort of weird "control" space
A0 - non-space break
U+200E - left to right mark
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
alvas
  • 115,346
  • 109
  • 446
  • 738

5 Answers5

68

Clear all non-ASCII characters of file.txt:

$ iconv -c -f utf-8 -t ascii file.txt
$ strings file.txt

Options:

-c # discard unconvertible characters
-f # from ENCODING
-t # to ENCODING
wisbucky
  • 33,218
  • 10
  • 150
  • 101
kev
  • 155,172
  • 47
  • 273
  • 272
55

If you want to remove only particular characters and you have Python, you can:

CHARS=$(python -c 'print u"\u0091\u0092\u00a0\u200E".encode("utf8")')
sed 's/['"$CHARS"']//g' < /tmp/utf8_input.txt > /tmp/ascii_output.txt
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Michał Šrajer
  • 30,364
  • 7
  • 62
  • 85
  • Maybe not the prettiest. But it worked very well for me. By constructing the CHARS variable, it made the sed easier to read, and CHARS variable can be easily maintained. Choroba's answer also works, so I guess it's a matter of taste (and if you have Python handy). – Paulb Feb 17 '14 at 13:03
  • 2
    It is an alternative code of python part.`python -c 'print "".join(map(unichr, range(0x80, 0xa0) + range(0x2000, 0x200f))).encode("utf-8")'` – ENDOH takanao Mar 17 '15 at 04:15
  • 2
    in recent linux os'es you can write unicode characters by pressing Ctrl+Shift+u followed by the numeric code and , e.g. `Ctrl+Shift+u 0019 ⏎` – smoebody Apr 26 '16 at 11:01
  • Is it faster to do an in place edit if all the text is separted by new lines than using `< path > newpath`? Have a massive file, why I ask.. – Joshua Robinson Sep 20 '16 at 10:28
  • Comment by kev on Chobra's answer is what I found most useful. You can plug that with this answer to get `CHARS=$(echo -ne '\u200c')` followed by the same `sed` line. – Hrishikesh Feb 17 '18 at 14:21
35

For UTF-8 encoding of Unicode, you can use this regular expression for sed:

sed 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//g'
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
choroba
  • 231,213
  • 25
  • 204
  • 289
  • 4
    how do i find the mapping from `U+...` to `\xc2\...` ? – alvas Dec 19 '11 at 14:37
  • The | doesn't work for me this way in sed, so I had to string a series of sed commands with single replaces together. – Jonathan W. Oct 27 '19 at 01:41
  • @JonathanW. Wasn't it rather the missing `/g`? – choroba Oct 27 '19 at 08:47
  • 2
    There are quite a few differences between systems here. MacOS doesn't support the \xNN codes and RHEL requires the use of the -r option for sed to be able to use them. Just something to keep in mind in case you're developing a script on one system and deploying to another (generally not the best idea, but that's never prevented people from doing so) :) – Joe Dyndale Sep 15 '20 at 10:33
  • @JonathanW. maybe you want to add `-e` to the sed command in order to use pipes as within regex – OldFart May 18 '22 at 13:10
  • @OldFart, I believe `-r` would have worked, as @JoeDyndale mentioned. – Jonathan W. May 19 '22 at 16:20
16

Use iconv:

iconv -f utf8 -t ascii//TRANSLIT < /tmp/utf8_input.txt > /tmp/ascii_output.txt

This will translate characters like "Š" into "S" (most similar looking ones).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Michał Šrajer
  • 30,364
  • 7
  • 62
  • 85
  • 1
    they are not ascii, i want to keep them in utf8 but i want to replace these weird spaces into normal null string `""` – alvas Dec 19 '11 at 14:09
  • Not what the OP wanted, but I had a need to convert a unicode line-seperator (u2028) into a newline. I would have preferred to use iconv, but I couldn't figure out how to do it. Is there a way? – Chris Quenelle Oct 01 '13 at 18:05
  • the -c flag is useful to discard characters that cannot be transliterated, avoiding a fatal error. – Eric Bréchemier Sep 08 '14 at 09:10
  • 1
    As an alternative to -c, --unicode-subst allows to specify a pattern for the substitution of the character, instead of removing it completely. For example, --unicode-subst='?' allows to replace non-identifiable characters with a question mark. – Eric Bréchemier Sep 08 '14 at 10:31
  • @ChrisQuenelle - its years later but did you ever solve your problem? I have the same issue. – JBCP Mar 12 '15 at 19:40
  • It's been so long, I don't recall how I solved my problem. I think I got iconv to do what I wanted. – Chris Quenelle Mar 16 '15 at 21:35
  • scrolling down, this was the first answer that worked. – Jeffrey Sep 21 '19 at 14:31
2

Convert Swift files from UTF-8 to ASCII:

for file in *.swift; do
    iconv -f utf-8 -t ascii "$file" > "$file".tmp
    mv -f "$file".tmp "$file"
done

Swift auto completion not working in Xcode 6 Beta

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ma11hew28
  • 121,420
  • 116
  • 450
  • 651