An application on my computer needs to read in a text file. I have several, and one doesn't work; the program fails to read it and tells me that there is a bad character in it somewhere. My first guess is that there's a non-ascii character in there somewhere, but I have no idea how to find it. Perl or any generic regex would be nice. Any ideas?
Asked
Active
Viewed 1.2k times
13
-
What did you try so far? – nmagerko Jan 13 '12 at 02:59
-
I believe you can find an answer [here](http://stackoverflow.com/questions/881931/how-can-i-find-extended-ascii-characters-in-a-file-using-perl)? – Neilos Jan 13 '12 at 03:11
-
1I accepted mathematical coffee's because it was supereasy- worked in Notepad++ – Nate Glenn Jan 13 '12 at 19:24
3 Answers
12
You can use [^\x20-\x7E]
to match a non-ASCII character.
e.g. grep -P '[^\x20-\x7E]' suspicious_file

mathematical.coffee
- 55,977
- 11
- 154
- 194
-
3I had a problem using this, as it would also identify all of end of line characters in my file. Combining your answer with Ruakh's though worked like a charm: [^\t\n\r\x20-\x7E] – JMM Nov 07 '13 at 15:57
-
In my case, the [answer from the other question](http://stackoverflow.com/a/882437/873282) was better: `[\xE0-\xFF]` – koppor Jan 02 '16 at 13:50
4
perl -wne 'printf "byte %02X in line $.\n", ord $& while s/[^\t\n\x20-\x7E]//;'
will find every character that is not an ASCII glyphic character, tab, space, or newline.
If it reports 0D
s (carriage-returns) in files that are O.K., then change \t\n
to \t\n\r
.
If it only reports 0D
s in files that are bad, then you can probably fix those files by running dos2unix
on them.

ruakh
- 175,680
- 26
- 273
- 307
-
Just an addendum, one should run the input as the final non-listed argument. – josh.chavanne Feb 19 '14 at 22:03
-
Like that, thank you! I had to change it slightly for a DOS console: `perl -wne "printf qq(byte %02X in line $.\n), ord $& while s/[^\t\n\x20-\x7E]//;"
– rplantiko May 19 '14 at 12:16
2
If you use tabulators in your source code as well, try this pattern:
[^\x08-\x7E]
Works also in Notepad++

elwood
- 129
- 5