69

I want to remove all the non-ASCII characters from a file in place.

I found one solution with tr, but I guess I need to write back that file after modification.

I need to do it in place with relatively good performance.

Any suggestions?

dda
  • 6,030
  • 2
  • 25
  • 34
Sujit
  • 2,403
  • 4
  • 30
  • 36
  • can you provide a link to the one liner with tr? – Jordan Sitkin Jun 28 '16 at 19:00
  • The OP probably(?) meant non-printable characters (ctrl-c, unicode number U+0002, is an ASCII character). The question should also specify the locale - without that information one could(should?) assume he meant the "C" locale. A naive answer would be to strip any byte greater than 0x7f - that would preserve characters that are not printable in the C locale, but are perfectly legitimate ASCII characters. I'm downvoting the question because of these reasons which make the it too vague. – Juan Mar 07 '18 at 00:58

11 Answers11

87

A perl oneliner would do: perl -i.bak -pe 's/[^[:ascii:]]//g' <your file>

-i says that the file is going to be edited inplace, and the backup is going to be saved with extension .bak.

ssegvic
  • 3,123
  • 1
  • 20
  • 21
  • 1
    This one is also usable with `stdin` as input. – h3xStream Aug 08 '12 at 14:59
  • 3
    The perl solution is faster than the sed solution. Trying to update a 122 GB file using sed took 3 hours, while perl took about less than 2 hours for me. – user8128167 Sep 15 '14 at 19:01
  • I couldn't get the `sed` solution to work in my environment (Ubuntu gnu sed 4.2.2) but this worked like a charm. – steve klein Jun 01 '15 at 12:02
  • 1
    Tried everything and this was the only one that worked for me. Gotta love the power of Perl. Thanks! – jbrahy Dec 20 '16 at 19:00
  • However, when attempting to replace a non ascii character with say '?', '??' comes out as I speculate, perl replaces the two bytes of the Unicode character, thus one '?' per byte. $ echo "é" | perl -pe 's/[^[:ascii:]]/?/g' ?? – Hans Deragon Apr 24 '23 at 13:09
51
# -i (inplace)

sed -i 's/[\d128-\d255]//g' FILENAME
JellicleCat
  • 28,480
  • 24
  • 109
  • 162
Ivan
  • 1,511
  • 16
  • 22
  • @Sujit: Note that `sed -i` still creates an intermediate file. It just does it behind the scenes. – Dennis Williamson Jul 26 '10 at 19:57
  • @Dennis - then what would be the better solution? – Sujit Jul 26 '10 at 20:43
  • 4
    @Sujit: There's not a better solution. I just wanted to point out that an intermediate file is still created. Sometimes that matters. I just didn't want you to be under the assumption that it was doing it *literally* in place. – Dennis Williamson Jul 26 '10 at 21:22
  • On MacOSX, `sed: 1: "FILENAME": unterminated substitute pattern` – h3xStream Aug 08 '12 at 15:01
  • `sed -i "s/[\d128-\d255]//g" FILE` works for me on centos w/ GNU sed. You may have to use different quoting strategy (double quotes instead of single) depending on your OS/shell. – Joe Atzberger Aug 09 '13 at 16:27
  • 60
    Prints "Invalid collation character" on GNU sed 4.2.1. – Jason C Jun 18 '14 at 15:16
  • 31
    I can avoid the "invalid collation character" error with `LANG=C sed -i 's/[\d128-\d255]//g' FILE` – Patrick Dec 30 '14 at 21:58
  • 1
    @Patrick then your setup is broken. C locale implies 7-bit characters, and should generate that error with that pattern space. I recommend using a locale that has 8-bit characters, like iso-8859-1. That worked for me. – MarkI Jan 26 '15 at 18:39
  • On cygwin I got the same problem as @JasonC and Patrick's solution didn't fix it for me. I used the Perl solution below. – skiphoppy Nov 10 '16 at 21:15
  • @skiphoppy Try using double backslashes with cygwin. [related discussion](https://superuser.com/questions/552041/why-is-it-true-that-three-backslashes-are-needed-on-windows-for-sed-replace) – C8H10N4O2 Jan 19 '18 at 16:47
  • 2
    I fixed the "Invalid collation character" error by prefixing the sed invocation with `LC_ALL=C`. – Diomidis Spinellis Jan 02 '21 at 12:16
40

I tried all the solutions and nothing worked. The following, however, does:

tr -cd '\11\12\15\40-\176'

Which I found here:

https://alvinalexander.com/blog/post/linux-unix/how-remove-non-printable-ascii-characters-file-unix

My problem needed it in a series of piped programs, not directly from a file, so modify as needed.

Katastic Voyage
  • 965
  • 9
  • 17
18

Try tr instead of sed

tr -cd '[:print:]' < file.txt
Vivek
  • 11,938
  • 19
  • 92
  • 127
  • 6
    The OP specifically mentioned he didn't want to use tr (because he wanted an "in place" conversion which sed -i pretends to be - really writes to a temp file and renames behind the scenes). So this answer doesn't help the OP. BUT... for those who want to use tr, you might want to preserver newlines (the 20180228 version shown here does not). A simple tweak however preserves newlines and carriage returns: `tr -cd '[:print:]\n\r' < file.txt` – Juan Mar 07 '18 at 00:08
  • 1
    `tr -cd '[:print:]' – evandrix Aug 07 '19 at 21:48
16
sed -i 's/[^[:print:]]//' FILENAME

Also, this acts like dos2unix

jcalfee314
  • 4,642
  • 8
  • 43
  • 75
  • 12
    Does not work. [:print:] is not the same as ASCII. There are many printable non-ASCII characters. – Jason C Jun 18 '14 at 15:17
  • 1
    Also the g modifier is missing. Only the first non-printable character would be removed. – proski Nov 30 '17 at 00:18
  • 1
    @JasonC There are also many non-printable ASCII characters. It's likely the original question was poorly formed. – Juan Mar 07 '18 at 01:21
8
# -i (inplace)

LANG=C sed -i -E "s|[\d128-\d255]||g" /path/to/file(s)

The LANG=C part's role is to avoid a Invalid collation character error.

Based on Ivan's answer and Patrick's comment.

evandrix
  • 6,041
  • 4
  • 27
  • 38
Nicolas Raoul
  • 58,567
  • 58
  • 222
  • 373
6

I'm using a very minimal busybox system, in which there is no support for ranges in tr or POSIX character classes, so I have to do it the crappy old-fashioned way. Here's the solution with sed, stripping ALL non-printable non-ASCII characters from the file:

sed -i 's/[^a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILE
ACK_stoverflow
  • 3,148
  • 4
  • 24
  • 32
  • 1
    I don't have your system to test it on, but considering is character 32 (decimal) and tilde "~" is character 126, all of the printable ASCII characters fall between these. If your sed supports [a-z] type ranges, and [^ type "not in" syntax, you should be able to replace that long string of characters with: `sed -i 's/[^ -~]//g' FILE` (that's /[^-~]/) – JohnGH Nov 25 '20 at 15:46
  • 1
    @JohnGH Excellent, this does indeed work! A much better solution, albeit six years down the road :) – ACK_stoverflow Nov 25 '20 at 16:56
  • 1
    Sorry for the laggy response ;-) – JohnGH Dec 03 '20 at 15:49
6

This worked for me:

sed -i 's/[^[:print:]]//g'
Jorge Y. C. Rodriguez
  • 3,394
  • 5
  • 38
  • 61
AJn
  • 69
  • 1
  • 3
3
awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print }' MYinputfile.txt > pipe_out_to_CONVERTED_FILE.txt
guestSA
  • 31
  • 2
3

As an alternative to sed or perl you may consider to use ed(1) and POSIX character classes.

Note: ed(1) reads the entire file into memory to edit it in-place, so for really large files you should use sed -i ..., perl -i ...

# see:
# - http://wiki.bash-hackers.org/doku.php?id=howto:edit-ed
# - http://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes

# test
echo $'aaa \177 bbb \200 \214 ccc \254 ddd\r\n' > testfile
ed -s testfile <<< $',l' 
ed -s testfile <<< $'H\ng/[^[:graph:][:space:][:cntrl:]]/s///g\nwq'
ed -s testfile <<< $',l'
trevor
  • 31
  • 1
0

I appreciate the tips I found on this site.

But, on my Windows 10, I had to use double quotes for this to work ...

sed -i "s/[\d128-\d255]//g" FILENAME

Noticed these things ...

  1. For FILENAME the entire path\name needs to be quoted This didn't work -- %TEMP%\"FILENAME" This did -- %TEMP%\FILENAME"

  2. sed leaves behind temp files in the current directory, named sed*

Renats Stozkovs
  • 2,549
  • 10
  • 22
  • 26
Larry8811
  • 189
  • 1
  • 4
  • Note: this answer works with gnu sed, but is not portable to other versions of sed (e.g., bsd). Given the side effects mentioned in this answer, it seems like a weird windows compiled version that tries to emulate gnu sed. Or the user is muddying the water with unrelated shell issues. – Juan Mar 07 '18 at 01:30