10

I am trying to manipulate a text file and remove non-ASCII characters from the text. I don't want to remove the line. I only want to remove the offending characters. I am trying to get the following expression to work:

sed '/[\x80-\xFF]/d'

dda
  • 6,030
  • 2
  • 25
  • 34
M_x_r
  • 596
  • 4
  • 11
  • 26
  • 1
    See [this](http://stackoverflow.com/questions/3337936/remove-non-ascii-characters-from-csv) answer. – speakr Feb 22 '13 at 23:38
  • 1
    This thread might have the answer you are looking http://stackoverflow.com/questions/8571601/skip-remove-non-ascii-character-with-sed – Ifthikhan Feb 22 '13 at 23:38
  • Your command will delete all lines containing non-ascii characters. If that's not what you want, check the duplicate questions – Chris Dodd Feb 23 '13 at 00:02
  • I have tried two commands : 1) sed -E 's/[^[:print:]]//' <-- this should remove non printable characters. However, non printable stuff is still appearing. When I try to use sed -E 's/[\d128-\d255]//', I get a Invalid Collation error. Is there any other commands that somone can suggest to remove non-ascii characters only – M_x_r Feb 23 '13 at 00:15
  • There is decent perl example in the first comments link. If that is what you mean by "any other commands"... – Josh Feb 23 '13 at 00:29
  • Thanks Josh but I am looking to do it with Sed or maybe TR – M_x_r Feb 23 '13 at 00:32

1 Answers1

44

The suggested solutions may fail with specific version of sed, e.g. GNU sed 4.2.1.

Using tr:

tr -cd '[:print:]' < yourfile.txt

This will remove any characters not in [\x20-\x7e].

If you want to keep e.g. line feeds, just add \n:

tr -cd '[:print:]\n' < yourfile.txt

If you really want to keep all ASCII characters (even the control codes):

tr -cd '[:print:][:cntrl:]' < yourfile.txt

This will remove any characters not in [\x00-\x7f].

speakr
  • 4,141
  • 1
  • 22
  • 28
  • 1
    Hey speakr, is there a way to preserve the format of the text file. The tr command feeds everything onto a continuous line right? – M_x_r Feb 23 '13 at 00:39
  • 1
    @bosra: I added an example to preserve line feeds. – speakr Feb 23 '13 at 00:44
  • Man, if I could upvote this a few more times I would..Thanks – M_x_r Feb 23 '13 at 21:18
  • any idea why meld would still consider the fixed files as binary? btw, the result seems different from `tr -cd '\11\12\15\40-\176'` which worked with meld (at least with my files) [ref](http://alvinalexander.com/blog/post/linux-unix/how-remove-non-printable-ascii-characters-file-unix) – Aquarius Power Mar 01 '15 at 19:45
  • This question helped me a lot, but since I wanted to keep the \n and \t in the output file, I used the command below instead: tr -cd '[:print:][/n/t]' < yourfile.txt > output.txt – ccoutinho May 15 '15 at 13:10