Trying to remove non-printable characters (junk values) from a UNIX file

Question

I am trying to remove non-printable character (for e.g. ^@) from records in my file. Since the volume to records is too big in the file using cat is not an option as the loop is taking too much time. I tried using

sed -i 's/[^@a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILENAME

but still the ^@ characters are not removed. Also I tried using

awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print } FILENAME > NEW FILE

but it also did not help.

Can anybody suggest some alternative way to remove non-printable characters?

Used tr -cd but it is removing accented characters. But they are required in the file.

have created a normal /bin/sh script in unix box. This script will run over a file with 25 million recrods and fetch data from db too. However the records having junk value is being omitted by this script. — Pranav, Dec 22 '15 at 10:18
If you're seeing a lot of NULL (0x00, \0000) characters, it might be some sort of multi-byte encoding. **If** this is the case, these are not "junk" characters. The easiest way **I** know of to check is to load the file, or some portion of it, into `emacs`. — Erik Bennett, Dec 23 '15 at 22:22
Oop. I just found this. I **know** this will be faster than `emacs`. [Check if file contains multibyte character](http://stackoverflow.com/questions/10373258/check-if-file-contains-multibyte-character) — Erik Bennett, Dec 23 '15 at 22:36

Tom Fenech · Accepted Answer · 2015-12-22T10:04:01.130

23

Perhaps you could go with the complement of [:print:], which contains all printable characters:

tr -cd '[:print:]' < file > newfile

If your version of tr doesn't support multi-byte characters (it seems that many don't), this works for me with GNU sed (with UTF-8 locale settings):

sed 's/[^[:print:]]//g' file

edited Dec 22 '15 at 10:04

answered Dec 22 '15 at 09:48

Tom Fenech

72,334
12
107
141

On mac, you'd need `gsed` – Hritik Jul 24 '22 at 12:13

score 4 · Answer 2 · answered Dec 22 '15 at 15:03

Remove all control characters first:

tr -dc '\007-\011\012-\015\040-\376' < file > newfile

Then try your string:

sed -i 's/[^@a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' newfile

I believe that what you see ^@ is in fact a zero value \0.
The tr filter from above will remove those as well.

derek · Answer 3 · 2021-08-07T16:01:16.007

0

strings -1 file... > outputfile

seems to work. The strings program will take all printable characters, in this case of length 1 (the -1 argument) and print them. It effectively is removing all the non-printable characters.

"man strings" will provide the documentation.

edited Aug 07 '21 at 16:01

answered Nov 05 '19 at 22:38

derek

59
1
3

1

This reply is very short and lacks a minimum of explanation, so it is candidate for deletion. Please try to add some more explanation about the command you suggest. – linuxfan says Reinstate Monica Nov 06 '19 at 10:14

score 0 · Answer 4 · answered Nov 02 '21 at 18:26

0

Was searching for this for a while & found a rather simple solution:

The package ansifilter does exactly this. All you need to do is just pipe the output through it.

On Mac:

brew install ansifilter

Then:

cat file.txt | ansifilter

answered Nov 02 '21 at 18:26

Jikku Jose

18,306
11
41
61

Works on Linux as well. Thanks! Other solutions didn't work for me, as I wanted to convert string `"\033[?1002l\033[?1000l\033[?1005l\033[?2004h\033[?2004l\033[?1002l\033[?1000l\033[?1005ldebconf:"` (`\033` is escape character, similar to `\e`) – L_R Jan 17 '23 at 07:30

Trying to remove non-printable characters (junk values) from a UNIX file

4 Answers4

Linked

Related