13

An application on my computer needs to read in a text file. I have several, and one doesn't work; the program fails to read it and tells me that there is a bad character in it somewhere. My first guess is that there's a non-ascii character in there somewhere, but I have no idea how to find it. Perl or any generic regex would be nice. Any ideas?

Nate Glenn
  • 6,455
  • 8
  • 52
  • 95

3 Answers3

12

You can use [^\x20-\x7E] to match a non-ASCII character.

e.g. grep -P '[^\x20-\x7E]' suspicious_file

mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • 3
    I had a problem using this, as it would also identify all of end of line characters in my file. Combining your answer with Ruakh's though worked like a charm: [^\t\n\r\x20-\x7E] – JMM Nov 07 '13 at 15:57
  • In my case, the [answer from the other question](http://stackoverflow.com/a/882437/873282) was better: `[\xE0-\xFF]` – koppor Jan 02 '16 at 13:50
4
perl -wne 'printf "byte %02X in line $.\n", ord $& while s/[^\t\n\x20-\x7E]//;'

will find every character that is not an ASCII glyphic character, tab, space, or newline.

If it reports 0Ds (carriage-returns) in files that are O.K., then change \t\n to \t\n\r.

If it only reports 0Ds in files that are bad, then you can probably fix those files by running dos2unix on them.

ruakh
  • 175,680
  • 26
  • 273
  • 307
  • Just an addendum, one should run the input as the final non-listed argument. – josh.chavanne Feb 19 '14 at 22:03
  • Like that, thank you! I had to change it slightly for a DOS console: `perl -wne "printf qq(byte %02X in line $.\n), ord $& while s/[^\t\n\x20-\x7E]//;" – rplantiko May 19 '14 at 12:16
2

If you use tabulators in your source code as well, try this pattern:

[^\x08-\x7E]

Works also in Notepad++

elwood
  • 129
  • 5