13

I am trying to remove non-printable character (for e.g. ^@) from records in my file. Since the volume to records is too big in the file using cat is not an option as the loop is taking too much time. I tried using

sed -i 's/[^@a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILENAME

but still the ^@ characters are not removed. Also I tried using

awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print } FILENAME > NEW FILE 

but it also did not help.

Can anybody suggest some alternative way to remove non-printable characters?

Used tr -cd but it is removing accented characters. But they are required in the file.

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
Pranav
  • 129
  • 1
  • 1
  • 8
  • which language is used (unix parameter) ? – NeronLeVelu Dec 22 '15 at 10:11
  • have created a normal /bin/sh script in unix box. This script will run over a file with 25 million recrods and fetch data from db too. However the records having junk value is being omitted by this script. – Pranav Dec 22 '15 at 10:18
  • If you're seeing a lot of NULL (0x00, \0000) characters, it might be some sort of multi-byte encoding. **If** this is the case, these are not "junk" characters. The easiest way **I** know of to check is to load the file, or some portion of it, into `emacs`. – Erik Bennett Dec 23 '15 at 22:22
  • Oop. I just found this. I **know** this will be faster than `emacs`. [Check if file contains multibyte character](http://stackoverflow.com/questions/10373258/check-if-file-contains-multibyte-character) – Erik Bennett Dec 23 '15 at 22:36

4 Answers4

23

Perhaps you could go with the complement of [:print:], which contains all printable characters:

tr -cd '[:print:]' < file > newfile

If your version of tr doesn't support multi-byte characters (it seems that many don't), this works for me with GNU sed (with UTF-8 locale settings):

sed 's/[^[:print:]]//g' file
Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
4

Remove all control characters first:

tr -dc '\007-\011\012-\015\040-\376' < file > newfile

Then try your string:

sed -i 's/[^@a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' newfile

I believe that what you see ^@ is in fact a zero value \0.
The tr filter from above will remove those as well.

0
strings -1 file... > outputfile

seems to work. The strings program will take all printable characters, in this case of length 1 (the -1 argument) and print them. It effectively is removing all the non-printable characters.

"man strings" will provide the documentation.

derek
  • 59
  • 1
  • 3
0

Was searching for this for a while & found a rather simple solution:

The package ansifilter does exactly this. All you need to do is just pipe the output through it.

On Mac:

brew install ansifilter

Then:

cat file.txt | ansifilter

Jikku Jose
  • 18,306
  • 11
  • 41
  • 61
  • Works on Linux as well. Thanks! Other solutions didn't work for me, as I wanted to convert string `"\033[?1002l\033[?1000l\033[?1005l\033[?2004h\033[?2004l\033[?1002l\033[?1000l\033[?1005ldebconf:"` (`\033` is escape character, similar to `\e`) – L_R Jan 17 '23 at 07:30