search document for non-ascii

Question

An application on my computer needs to read in a text file. I have several, and one doesn't work; the program fails to read it and tells me that there is a bad character in it somewhere. My first guess is that there's a non-ascii character in there somewhere, but I have no idea how to find it. Perl or any generic regex would be nice. Any ideas?

I believe you can find an answer [here](http://stackoverflow.com/questions/881931/how-can-i-find-extended-ascii-characters-in-a-file-using-perl)? — Neilos, Jan 13 '12 at 03:11
I accepted mathematical coffee's because it was supereasy- worked in Notepad++ — Nate Glenn, Jan 13 '12 at 19:24

score 12 · Accepted Answer · answered Jan 13 '12 at 03:06

12

You can use [^\x20-\x7E] to match a non-ASCII character.

e.g. grep -P '[^\x20-\x7E]' suspicious_file

answered Jan 13 '12 at 03:06

mathematical.coffee

55,977
11
154
194

3

I had a problem using this, as it would also identify all of end of line characters in my file. Combining your answer with Ruakh's though worked like a charm: [^\t\n\r\x20-\x7E] – JMM Nov 07 '13 at 15:57
In my case, the [answer from the other question](http://stackoverflow.com/a/882437/873282) was better: `[\xE0-\xFF]` – koppor Jan 02 '16 at 13:50

score 4 · Answer 2 · answered Jan 13 '12 at 03:07

4

perl -wne 'printf "byte %02X in line $.\n", ord $& while s/[^\t\n\x20-\x7E]//;'

will find every character that is not an ASCII glyphic character, tab, space, or newline.

If it reports 0Ds (carriage-returns) in files that are O.K., then change \t\n to \t\n\r.

If it only reports 0Ds in files that are bad, then you can probably fix those files by running dos2unix on them.

answered Jan 13 '12 at 03:07

ruakh

175,680
26
273
307

Just an addendum, one should run the input as the final non-listed argument. – josh.chavanne Feb 19 '14 at 22:03
Like that, thank you! I had to change it slightly for a DOS console: `perl -wne "printf qq(byte %02X in line $.\n), ord $& while s/[^\t\n\x20-\x7E]//;" – rplantiko May 19 '14 at 12:16

score 2 · Answer 3 · answered Apr 12 '16 at 13:02

2

If you use tabulators in your source code as well, try this pattern:

[^\x08-\x7E]

Works also in Notepad++

answered Apr 12 '16 at 13:02

elwood

129
5

search document for non-ascii

3 Answers3

Linked