10

I want to remove non-ascii chars from some file. I have already tried these many regexs.

sed -e 's/[\d00-\d128]//g'  # not working

cat /bin/mkdir | sed -e 's/[\x00-\x7F]//g' >/tmp/aa

but this file contains some non-ascii chars.

[root@asssdsada ~]$ hexdump /tmp/aa |more
          00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F  0123456789ABCDEF

00000000  45 4C 46 B0 F0 73 38 C0 - C0 BC BC FF FF 61 61 61  ELF..s8......aaa
00000010  A0 A0 50 E5 74 64 50 57 - 50 57 50 57 D4 D4 51 E5  ..P.tdPWPWPW..Q.
00000020  74 64 6C 69 62 36 34 6C - 64 6C 69 6E 75 78 78 38  tdlib64ldlinuxx8
00000030  36 36 34 73 6F 32 47 4E - 55 42 C8 C0 80 70 69 42  664so2GNUB...piB
00000040  44 47 BA E3 92 43 45 D5 - EC 46 E4 DE D8 71 58 B9  DG...CE..F...qX.
00000050  8D F1 EA D3 EF 4B 86 FC - A9 DA 79 ED 63 B5 51 92  .....K....y.c.Q.
00000060  BA 6C FC D1 69 78 30 ED - 74 F1 73 95 CC 85 D2 46  .l..ix0.t.s....F
00000070  A5 B4 6C 67 DA 4A E9 9A - 4B 58 77 A4 37 80 C0 4F  ..lg.J..KXw.7..O
00000080  F3 E9 B2 77 65 97 74 F9 - A2 C0 F2 CC 4A 9C 58 A1  ...we.t.....J.X.
wjandrea
  • 28,235
  • 9
  • 60
  • 81
user87005
  • 964
  • 3
  • 10
  • 27

5 Answers5

21

This doesn't seem to work with sed. Perhaps tr will do?

tr -d '\200-\377'

Or with the complement:

tr -cd '\000-\177'
Thor
  • 45,082
  • 11
  • 119
  • 130
8

Did you try

cat /bin/mkdir | tr -cd "[:print:]"

I think it solves the problem ?

If only text content interest you, you can also use

cat /bin/mkdir | strings
sebtic
  • 198
  • 1
  • 6
3

Do you know what encoding the file is currently using? If so, you can use iconv to convert it. It's a utility to convert from one character encoding to another. So if the original file is in UTF-8 and you want to convert to ASCII you can use the following:

iconv -f utf8 -t ascii <inputfile>

The file command on the input file might tell you the current encoding.

Interestingly, there's a command called enca which will do its best to determine the character encoding being used if you know the language of the contents of the file.

This other question might be the answer.

Community
  • 1
  • 1
chooban
  • 9,018
  • 2
  • 20
  • 36
1

The solutions offered here did not work for me. Maybe my problem was different, but I needed to strip the ASCII colors and other characters from the otherwise pure ASCII text.

The following worked for me, however:

Stripping Escape Codes from ASCII Text

sed -E 's/\x1b\[[0-9]*;?[0-9]+m//g'

In context (BASH):

$ printf "\e[32;1mhello\e[0m\n"
hello

$ printf "\e[32;1mhello\e[0m\n" | cat -vet
^[[32;1mhello^[[0m$

$ printf "\e[32;1mhello\e[0m\n" | sed -E 's/\x1b\[[0-9]*;?[0-9]+m//g' | cat -vet
hello$
0

Try with sed -i option, eg.

sed -i 's/[\d128-\d255]//g' MYFILE.txt

it will replace all non-ascii characters in the file.

oz123
  • 27,559
  • 27
  • 125
  • 187