425

I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following:

grep -e "[\x{00FF}-\x{FFFF}]" file.xml

But this returns every line in the file, regardless of whether the line contains a character in the range specified.

Do I have the syntax wrong or am I doing something else wrong? I've also tried:

egrep "[\x{00FF}-\x{FFFF}]" file.xml 

(with both single and double quotes surrounding the pattern).

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
pconrey
  • 5,805
  • 7
  • 29
  • 38
  • ASCII characters are only one byte long, so unless the file is unicode there should be no characters above 0xFF. – zdav Jun 08 '10 at 20:53
  • How do we go above \xFF? Grep gives a "grep: range out of order in character class" error. – Mudit Jain Dec 08 '14 at 19:16
  • 1
    Sometimes it's nice to have a second opinion about chars with the high bit set in a file. In that case, I like `tr foo.out && ls -al foo.out` to get a count. And/or followed by `od -x foo.out` to get a look at actual values. – Ron Burk Aug 26 '21 at 05:54
  • The [awk solution](https://stackoverflow.com/a/69498200/41906) and [C locale + grep](https://stackoverflow.com/a/3208902/41906) work on BSD. – Clint Pachl May 05 '22 at 00:58

16 Answers16

581

You can use the command:

grep --color='auto' -P -n "[\x80-\xFF]" file.xml

This will give you the line number, and will highlight non-ascii chars in red.

In some systems, depending on your settings, the above will not work, so you can grep by the inverse

grep --color='auto' -P -n "[^\x00-\x7F]" file.xml

Note also, that the important bit is the -P flag which equates to --perl-regexp: so it will interpret your pattern as a Perl regular expression. It also says that

this is highly experimental and grep -P may warn of unimplemented features.

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
jerrymouse
  • 16,964
  • 16
  • 76
  • 97
  • 54
    This won't work in BSD `grep` (on OS X 10.8 Mountain Lion), as it does not support the `P` option. – Bastiaan M. van de Weerd Oct 22 '12 at 09:54
  • 21
    To update my last comment, the GNU version of `grep` is available in Homebrew's `dupes` library (enable using `brew tap homebrew/dupes`): `brew install grep` – Bastiaan M. van de Weerd Oct 22 '12 at 10:03
  • 55
    @BastiaanVanDeWeerd is correct, grep on OSX 10.8 no longer supports PCRE ("Perl-compatible regular expressions") as Darwin now uses BSD grep instead of GNU grep. An alternative to installing the `dupes` library is to install `pcre` instead: `brew install pcre`... as part of this, you will get the `pcregrep` utility, which you can use as follows: `pcregrep --color='auto' -n "[\x80-\xFF]" file.xml` – pvandenberk Dec 04 '12 at 11:24
  • Keep in mind that using the `-n` flag means you will incur a performance drop. Usually not a big deal, but it is if you're working on very large files (as you had mentioned). – Vinay Jul 31 '13 at 00:10
  • You can also use grep -f pattern.txt. Put your pattern (not unicode coded chars but simply cyrillic text) to pattern.txt. This works with cyrillic. – Developer Marius Žilėnas Oct 09 '13 at 06:02
  • this continues to be a helpful answer over a year later. I have an issue where I'm running jekyllrb triggered by incron rules and when I manually run the jekyll build command everything works fine, but for some reason when the jekyll build command is run by incron it ends up choking with an "ArgumentError: invalid byte sequence in US-ASCII" error and while I'm still trying to sort the encoding issue so that its all being handled as UTF-8 this answer at least let me find the offending characters so that it would start working in the meantime. – Stephen Washburn Nov 29 '13 at 01:27
  • 19
    For Mac `brew` users, [GNU's coreutils](https://www.gnu.org/software/coreutils/) can be installed with `brew install coreutils`. This will give you lots of GNU tools prefixed with a 'g' - in this case use `ggrep`. This should avoid problems arising from replacing a system utility, since system-specific Mac scripts now depend on BSD grep. – Joel Purra Jun 24 '14 at 07:37
  • 26
    this works fine on a mac `ag "[\x80-\xFF]" file` you just need to install `the_silver_searcher` – slf Aug 07 '14 at 15:52
  • 1
    I've found that when most people say Non-ASCII, they mean non-printable. So, it would be better to use this: "[^\n -~]". Add \r for DOS text files. That translates to "[^\x0A\x020-\x07E]" and \x0D for the \r. – Harvey Oct 03 '14 at 13:18
  • 2
    @JoelPurra coreutils doesn't seem to include grep. – mjs Nov 02 '14 at 17:06
  • 1
    If you don't have GNU grep, but do have Perl, something like this might work: `perl -ne 'chomp, printf("%s: %s\n", $ARGV, $_) if /[^\n -~]/' file.xml`. – mjs Nov 02 '14 at 17:07
  • @mjs: oops. It's `brew install homebrew/dupes/grep` or `brew install grep` if you already did `brew tap homebrew/dupes`. Had installed it through [Mathias Bynens' dotfiles](https://github.com/mathiasbynens/dotfiles), mistook it for a coreutils program. – Joel Purra Nov 03 '14 at 21:21
  • 3
    I'm having a problem with Hangul Korean: `echo '소녀시대' | grep -P "[\x80-\xFF]"` returns nothing for me -- can anyone else confirm? (GNU grep 2.21) – frabjous Jan 09 '15 at 02:39
  • This command won't include Chinese characters, as @frabjous mentioned, Korean characters were exclude, too. pvandenberk's answer below can well solve this problem. – Zen Apr 16 '15 at 03:59
  • 7
    Prefix this command with `LC_ALL=C` ! On my system (where `LANG=en_US.UTF-8`), this command alone is unable to find in an UTF-8 file the curly apostrophe ’ (`Right single quotation mark` / `U+2019`) used very often by MS software, instead of the ASCII one – calandoa Apr 23 '15 at 14:29
  • To add to what frabjous and calandoa wrote, this also misses U+2013: EN DASH in a UTF-8 file and probably UTF-16 too (I didn't test UTF-16). It was useful for a good, first order search, though. – twm Dec 20 '15 at 16:58
  • You can often use "pcregrep" if your system grep doesn't contain the perl extension. Though, that's a good chance you won't have pcregrep either. But it's often easier to install a pcregrep that your vendor may have packaged than it woudl be to build your own grep. :) – dannysauer Feb 01 '16 at 18:35
  • How do I grep for U+2028? – kev Oct 20 '16 at 23:19
  • @kev Since U+2028 is encoded as `e2 80 a8` you can search it like this: `ag "\xe2\x80\xa8"`. Note that `grep -P "\xe2\x80\xa8"` did not work for me with GNU grep 2.27. – psmith Dec 20 '16 at 04:39
  • This seems to work for Unicode files but not for Iso-latin-1/Windows-1252 files (these are 8 bit character sets that have non-ascii characters like à © (accents, copyright) etc in positions 128-255). – ttulinsky Jan 05 '17 at 17:49
  • 1
    Note if the character is a non-breaking space (&nbsp) it will NOT show up in red because it is printed as a space. The command `LC_ALL=C grep '[^ -~]' file.xml` from Gilles below DOES show &nbsp as an illegal character (diamond with question mark), for both utf-8 and Iso-latin-1/Windows-1252 files – ttulinsky Jan 05 '17 at 18:22
  • And for those who want/need to process *lots* of files, you can combine the command with the `find` command; e.g. `find . -type f | xargs grep --color='auto' -P -n "[^\x00-\x7F]"` – code_dredd Nov 05 '17 at 03:52
  • Within GitBash on Windows, both `-n` and escape sequences seem to be not needed, as the **following worked fine** for me: `grep -P '\d+[А-Яа-я]+' AW.md` – Mike Makarov Feb 17 '18 at 12:52
  • Didn't not work for unicode `'`s on Linux. The [next solution](https://stackoverflow.com/a/13702856/1136400) worked. – Doncho Gunchev Apr 09 '20 at 09:11
  • "In some systems, depending on your settings" is very fuzzy and hand-wavy. I think it's about the fact that utf8 decoding happens before the grepping. And it means `[^\x00-\x7F]'` is the preferred solution while this answer only offers it as a fallback in unexplained circumstances, so I upvoted the answer from pvandenberk instead. – David Faure Jul 02 '21 at 09:24
154

Instead of making assumptions about the byte range of non-ASCII characters, as most of the above solutions do, it's slightly better IMO to be explicit about the actual byte range of ASCII characters instead.

So the first solution for instance would become:

grep --color='auto' -P -n '[^\x00-\x7F]' file.xml

(which basically greps for any character outside of the hexadecimal ASCII range: from \x00 up to \x7F)

On Mountain Lion that won't work (due to the lack of PCRE support in BSD grep), but with pcre installed via Homebrew, the following will work just as well:

pcregrep --color='auto' -n '[^\x00-\x7F]' file.xml

Any pros or cons that anyone can think off?

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
pvandenberk
  • 4,649
  • 2
  • 26
  • 14
  • 10
    This actually worked for me where the above solutions failed. Finding M$ Word apostrophes hasn't been easier! – AlbertEngelB Apr 27 '15 at 20:17
  • 5
    If you have a bash-compatible shell but not pcre-grep working, `LC_COLLATE=C grep $'[^\1-\177]'` works (for files without null bytes) – idupree Jun 03 '15 at 17:48
  • 2
    This solution seems to work more consistently than the ones above. – 0xcaff Jul 31 '15 at 16:26
  • 1
    I had to use this to pickup Kanji, Cyrillic and Traditional Chinese in my UTF8 file, using "[\x80-\xFF]" missed all of these. – buckaroo1177125 Aug 13 '15 at 04:59
  • 1
    The pro is this worked excellently while the other options were great but not as great. No cons found so far. – jwpfox Sep 19 '16 at 11:03
  • what does the -n do? – wide_eyed_pupil May 09 '18 at 17:09
  • `-n, --line-number` Each output line is preceded by its relative line number in the file, starting at line 1. The line number counter is reset for each file processed. This option is ignored if -c, -L, -l, or -q is specified. – harperville Feb 18 '20 at 05:01
  • Above solution did work for me for unicode char like `œ` – malat Jan 05 '21 at 08:54
71

The easy way is to define a non-ASCII character... as a character that is not an ASCII character.

LC_ALL=C grep '[^ -~]' file.xml

Add a tab after the ^ if necessary.

Setting LC_COLLATE=C avoids nasty surprises about the meaning of character ranges in many locales. Setting LC_CTYPE=C is necessary to match single-byte characters — otherwise the command would miss invalid byte sequences in the current encoding. Setting LC_ALL=C avoids locale-dependent effects altogether.

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
  • 1
    On RedHat 6.4 with tcsh, I had to use <<< env LC_COLLATE=C grep -n '[^ -~]' file.xml >>>. I added -n to get the line number. – ddevienne Feb 06 '14 at 09:43
  • For me `echo "A" | LC_COLLATE=C grep '[^ -~]'` returns a match – frabjous Jan 09 '15 at 02:54
  • @frabjous It shouldn't. WHat are your other locale settings (output of `locale`)? On what platform? – Gilles 'SO- stop being evil' Jan 09 '15 at 11:11
  • `locale` gives `en_US.UTF-8` for all the relevant variables mentioned in the grep man page (LC_ALL, LANG, etc.). Testing a bit more, however, I see that `echo "A" | LC_ALL=C grep '[^ -~]'` works as expected, except that if then send Unicode characters into the pipe, they are garbled in the result. I'm on ArchLinux x86_64. – frabjous Jan 09 '15 at 15:07
  • 5
    @frabjous If you have `LC_ALL=en_US.UTF-8`, that trumps the `LC_COLLATE` setting. You shouldn't have this in your environment! `LC_ALL` is only to force a specific task to use a particular locale, usually `C`. To set the default locale for all categories, set `LANG`. – Gilles 'SO- stop being evil' Jan 09 '15 at 16:12
  • Thanks. I had `export LC_ALL=en_US.UTF-8` in my `.bashrc` for reasons I can't recall. – frabjous Jan 09 '15 at 21:38
  • For some reason, this command takes 30 times longer for me (15 seconds instead of 0.5 seconds) than the most upvoted ones, on a file with 1499863 lines (155 MB). – gerrit Dec 15 '15 at 12:35
  • @gerrit My crystal ball tells me that you use GNU grep in a multibyte locale. It can be very slow with some regexps. In any case, my answer was wrong (or at least incomplete): it would have missed invalid byte sequences in the ambient locale. Try again with `LC_ALL=C`. – Gilles 'SO- stop being evil' Dec 15 '15 at 13:04
  • @Gilles Right, now it's only 20% slower rather than a factor 30 :) – gerrit Dec 15 '15 at 13:51
  • 1
    At first, I didn't add `LC_ALL=C`, it behaves differently on Mac OS X and Ubuntu. After I add this setting, they give the same result. – Max Peng Jun 14 '16 at 07:23
  • Can this offer any extra benefit in British LaTeX documents? It returns many LaTeX syntax symbols but I am not sure if they are right/wrong LaTeX symbols. Thread here http://unix.stackexchange.com/q/326246/16920 – Léo Léopold Hertz 준영 Nov 27 '16 at 09:23
  • 2
    This works on a Mac, while the other grep-based solutions don't. – Matthias Fripp Oct 24 '17 at 21:38
  • Works on MAC. Verified – Pal Aug 18 '22 at 16:41
68

The following works for me:

grep -P "[\x80-\xFF]" file.xml

Non-ASCII characters start at 0x80 and go to 0xFF when looking at bytes. Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want. The -P option in my grep allows the use of \xdd escapes in character classes to accomplish what you want.

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
Thelema
  • 14,257
  • 6
  • 27
  • 35
  • 1
    For the view that might not immediately know how to call this over multiple files, just run: find . -name *.xml | xargs grep -P "[\x80-\xFF]" – David Mohundro Nov 17 '10 at 03:30
  • 1
    This does return a match, but there is no indication of what the character is and where it is. How does one see what the character is, and where it is? – Faheem Mitha Oct 20 '11 at 06:25
  • Adding the "-n" will give the line number, additionally non-visible chars will show as a block at the terminal: grep -n -P "[\x80-\xFF]" file.xml – fooMonster Oct 20 '11 at 12:53
  • 6
    I'm having a problem with Hangul Korean: `echo '소녀시대' | grep -P "[\x80-\xFF]"` returns nothing for me -- can anyone else confirm? (GNU grep 2.21) – frabjous Jan 09 '15 at 02:40
  • @frabjous Same here, but grepping the inverse works: `echo '소녀시대' | grep -P "[^\x00-\x7F]"`. Or just use `the_silver_searcher` as pointed out by @slf: `echo '소녀시대' | ag "[\x80-\xFF]"` – psmith Dec 20 '16 at 04:30
62

In perl

perl -ane '{ if(m/[[:^ascii:]]/) { print  } }' fileName > newFile
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
noquery
  • 1,895
  • 1
  • 15
  • 16
30

Here is another variant I found that produced completely different results from the grep search for [\x80-\xFF] in the accepted answer. Perhaps it will be useful to someone to find additional non-ascii characters:

grep --color='auto' -P -n "[^[:ascii:]]" myfile.txt

Note: my computer's grep (a Mac) did not have -P option, so I did brew install grep and started the call above with ggrep instead of grep.

ryanm
  • 2,239
  • 21
  • 31
22

Searching for non-printable chars. TLDR; Executive Summary

  1. search for control chars AND extended unicode
  2. locale setting e.g. LC_ALL=C needed to make grep do what you might expect with extended unicode

SO the preferred non-ascii char finders:

$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test

as in top answer, the inverse grep:

$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test

as in top answer but WITH LC_ALL=C:

$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test

. . more . . excruciating detail on this: . . .

I agree with Harvey above buried in the comments, it is often more useful to search for non-printable characters OR it is easy to think non-ASCII when you really should be thinking non-printable. Harvey suggests "use this: "[^\n -~]". Add \r for DOS text files. That translates to "[^\x0A\x020-\x07E]" and add \x0D for CR"

Also, adding -c (show count of patterns matched) to grep is useful when searching for non-printable chars as the strings matched can mess up terminal.

I found adding range 0-8 and 0x0e-0x1f (to the 0x80-0xff range) is a useful pattern. This excludes the TAB, CR and LF and one or two more uncommon printable chars. So IMHO a quite a useful (albeit crude) grep pattern is THIS one:

grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *

ACTUALLY, generally you will need to do this:

LC_ALL=C grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *

breakdown:

LC_ALL=C - set locale to C, otherwise many extended chars will not match (even though they look like they are encoded > 0x80)
\x00-\x08 - non-printable control chars 0 - 7 decimal
\x0E-\x1F - more non-printable control chars 14 - 31 decimal
\x80-1xFF - non-printable chars > 128 decimal
-c - print count of matching lines instead of lines
-P - perl style regexps

Instead of -c you may prefer to use -n (and optionally -b) or -l
-n, --line-number
-b, --byte-offset
-l, --files-with-matches

E.g. practical example of use find to grep all files under current directory:

LC_ALL=C find . -type f -exec grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" {} + 

You may wish to adjust the grep at times. e.g. BS(0x08 - backspace) char used in some printable files or to exclude VT(0x0B - vertical tab). The BEL(0x07) and ESC(0x1B) chars can also be deemed printable in some cases.

Non-Printable ASCII Chars
** marks PRINTABLE but CONTROL chars that is useful to exclude sometimes
Dec   Hex Ctrl Char description           Dec Hex Ctrl Char description
0     00  ^@  NULL                        16  10  ^P  DATA LINK ESCAPE (DLE)
1     01  ^A  START OF HEADING (SOH)      17  11  ^Q  DEVICE CONTROL 1 (DC1)
2     02  ^B  START OF TEXT (STX)         18  12  ^R  DEVICE CONTROL 2 (DC2)
3     03  ^C  END OF TEXT (ETX)           19  13  ^S  DEVICE CONTROL 3 (DC3)
4     04  ^D  END OF TRANSMISSION (EOT)   20  14  ^T  DEVICE CONTROL 4 (DC4)
5     05  ^E  END OF QUERY (ENQ)          21  15  ^U  NEGATIVE ACKNOWLEDGEMENT (NAK)
6     06  ^F  ACKNOWLEDGE (ACK)           22  16  ^V  SYNCHRONIZE (SYN)
7     07  ^G  BEEP (BEL)                  23  17  ^W  END OF TRANSMISSION BLOCK (ETB)
8     08  ^H  BACKSPACE (BS)**            24  18  ^X  CANCEL (CAN)
9     09  ^I  HORIZONTAL TAB (HT)**       25  19  ^Y  END OF MEDIUM (EM)
10    0A  ^J  LINE FEED (LF)**            26  1A  ^Z  SUBSTITUTE (SUB)
11    0B  ^K  VERTICAL TAB (VT)**         27  1B  ^[  ESCAPE (ESC)
12    0C  ^L  FF (FORM FEED)**            28  1C  ^\  FILE SEPARATOR (FS) RIGHT ARROW
13    0D  ^M  CR (CARRIAGE RETURN)**      29  1D  ^]  GROUP SEPARATOR (GS) LEFT ARROW
14    0E  ^N  SO (SHIFT OUT)              30  1E  ^^  RECORD SEPARATOR (RS) UP ARROW
15    0F  ^O  SI (SHIFT IN)               31  1F  ^_  UNIT SEPARATOR (US) DOWN ARROW

UPDATE: I had to revisit this recently. And, YYMV depending on terminal settings/solar weather forecast BUT . . I noticed that grep was not finding many unicode or extended characters. Even though intuitively they should match the range 0x80 to 0xff, 3 and 4 byte unicode characters were not matched. ??? Can anyone explain this? YES. @frabjous asked and @calandoa explained that LC_ALL=C should be used to set locale for the command to make grep match.

e.g. my locale LC_ALL= empty

$ locale
LANG=en_IE.UTF-8
LC_CTYPE="en_IE.UTF-8"
.
.
LC_ALL=

grep with LC_ALL= empty matches 2 byte encoded chars but not 3 and 4 byte encoded:

$ grep -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" notes_unicode_emoji_test
5:© copyright c2a9
7:call  underscore c2a0
9:CTRL
31:5 © copyright
32:7 call  underscore

grep with LC_ALL=C does seem to match all extended characters that you would want:

$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test  
1:���� unicode dashes e28090
3:��� Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5:� copyright c2a9
7:call� underscore c2a0
11:LIVE��E! ���������� ���� ���������� ���� �� �� ���� ����  YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29:1 ���� unicode dashes
30:3 ��� Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31:5 � copyright
32:7 call� underscore
33:11 LIVE��E! ���������� ���� ���������� ���� �� �� ���� ����  YEOW, mix of japanese and chars from other
34:52 LIVE��E! ���������� ���� ���������� ���� �� �� ���� ����  YEOW, mix of japanese and chars from other
81:LIVE��E! ���������� ���� ���������� ���� �� �� ���� ����  YEOW, mix of japanese and chars from other

THIS perl match (partially found elsewhere on stackoverflow) OR the inverse grep on the top answer DO seem to find ALL the ~weird~ and ~wonderful~ "non-ascii" characters without setting locale:

$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test

$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test  

1 ‐‐ unicode dashes e28090
3  Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5 © copyright c2a9
7 call  underscore c2a0
9 CTRL-H CHARS URK URK URK 
11 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ  YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29 1 ‐‐ unicode dashes
30 3  Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31 5 © copyright
32 7 call  underscore
33 11 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ  YEOW, mix of japanese and chars from other
34 52 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ  YEOW, mix of japanese and chars from other
73 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ  YEOW, mix of japanese and chars from other

SO the preferred non-ascii char finders:

$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test

as in top answer, the inverse grep:

$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test

as in top answer but WITH LC_ALL=C:

$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
gaoithe
  • 4,218
  • 3
  • 30
  • 38
  • 1
    Answer to why grep doesn't match characters encoded in more than 2 bytes thanks to @calandoa and frabjous in comments above on question. Use LC_ALL=C before the grep command. – gaoithe Aug 23 '19 at 11:12
  • 1
    Thanks so much for bothering to post an answer buried under 800 other upvotes! My problem was a 0x02 character. You may want to put that "practical example of use" near the top, since you really don't need to read the whole post to just see if that's your problem. – Noumenon Sep 11 '19 at 22:33
  • 1
    I know, really old answer, and excrutiating detail, but correct useful for me and others also I hope. You are right, I added TLDR; at top. – gaoithe Sep 13 '19 at 12:33
9

The following code works:

find /tmp | perl -ne 'print if /[^[:ascii:]]/'

Replace /tmp with the name of the directory you want to search through.

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
user7417071
  • 91
  • 1
  • 1
2

This method should work with any POSIX-compliant version of awk and iconv. We can take advantage of file and tr as well.

curl is not POSIX, of course.

Solutions above may be better in some cases, but they seem to depend on GNU/Linux implementations or additional tools.

Just get a sample file somehow:

$ curl -LOs http://gutenberg.org/files/84/84-0.txt

$ file 84-0.txt

84-0.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators

Search for UTF-8 characters:

$ awk '/[\x80-\xFF]/ { print }' 84-0.txt

or non-ASCII (not POSIX after all, see possible solution below)

$ awk '/[^[:ascii:]]/ { print }' 84-0.txt

Convert UTF-8 to ASCII, removing problematic characters (including BOM which should not be in UTF-8 anyway):

$ iconv -c -t ASCII 84-0.txt > 84-ascii.txt

Check it:

$ file 84-ascii.txt

84-ascii.txt: ASCII text, with CRLF line terminators

Tweak it to remove DOS line endings / ^M ("CRLF line terminators"):

$ tr -d '\015' < 84-ascii.txt > 84-tweaked.txt && file 84-tweaked.txt

84-tweaked.txt: ASCII text

This method discards any "bad" characters it cannot deal with, so you may need to sanitize / validate the output. YMMV

>> UPDATE << I have been using something closer to this lately:

$ LC_ALL=C tr -d '[:print:]' < 84-0.txt | fold -w 1 | sort -u | sed -n l

But I am not sure of how portable it is but it gives me the option to automate swapping out characters or strings.

I do not have quick access to a real UNIX right now, but I think those are all POSIX-compliant options and switches. I do know it is pretty fast. YMMV.

Kajukenbo
  • 109
  • 1
  • 4
1

Strangely, I had to do this today! I ended up using Perl because I couldn't get grep/egrep to work (even in -P mode). Something like:

cat blah | perl -en '/\xCA\xFE\xBA\xBE/ && print "found"'

For unicode characters (like \u2212 in example below) use this:

find . ... -exec perl -CA -e '$ARGV = @ARGV[0]; open IN, $ARGV; binmode(IN, ":utf8"); binmode(STDOUT, ":utf8"); while (<IN>) { next unless /\N{U+2212}/; print "$ARGV: $&: $_"; exit }' '{}' \;
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
dty
  • 18,795
  • 6
  • 56
  • 82
  • In this scenario you probably need to check the locales as mentioned in https://stackoverflow.com/a/3208902/7809404 – user8162 Jan 21 '21 at 09:13
1

It could be interesting to know how to search for one unicode character. This command can help. You only need to know the code in UTF8

grep -v $'\u200d'
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
arezae
  • 51
  • 2
  • I'm not really an expert, but I know enough to know that's not a UTF8 representation, it's UTF16, or maybe UTF32, or UCS16. For a 2-byte codepoint those three might all be the same. – Baxissimo Apr 11 '18 at 18:01
  • @Baxissimo : then clearly you don't know enough. That indeed generates a 3-byte `UTF-8` compliant sequence of `\342\200\215`. That also cannot be simultaneously `UTF-16` and `UTF-32`, since `UTF-32` requires `NUL` byte padding for all code points, but one can indeed locate the `UTF-16` byte-sequence within the `UTF-32` one as a prefix substring (for `little endian`) or as a suffix substring (for `big endian`) – RARE Kpop Manifesto Mar 23 '23 at 17:27
1

Finding all non-ascii characters gives the impression that one is either looking for unicode strings or intends to strip said characters individually.

For the former, try one of these (variable file is used for automation):

file=file.txt ; LC_ALL=C grep -Piao '[\x80-\xFF\x20]{7,}' $file | iconv -f $(uchardet $file) -t utf-8

file=file.txt ; pcregrep -iao '[\x80-\xFF\x20]{7,}' $file | iconv -f $(uchardet $file) -t utf-8

file=file.txt ; pcregrep -iao '[^\x00-\x19\x21-\x7F]{7,}' $file | iconv -f $(uchardet $file) -t utf-8

Vanilla grep doesn't work correctly without LC_ALL=C as noted in the previous answers.

ASCII range is x00-x7F, space is x20, since strings have spaces the negative range omits it.

Non-ASCII range is x80-xFF, since strings have spaces the positive range adds it.

String is presumed to be at least 7 consecutive characters within the range. {7,}.

For shell readable output, uchardet $file returns a guess of the file encoding which is passed to iconv for automatic interpolation.

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
noabody
  • 179
  • 4
0

This works for me.

Notes:

  • without using LC_ALL=C in front it does not work in Ubuntu 22.04
  • Because in my case lines are huge, I want to see the only the non-ascii characters and the line and absolute offset on which each match occured.
  • -o: only match, -b: matched byte offset, -l: matched line number -P: perl-regexp

grep non-ascii characters:

Command:

LC_ALL=C grep --color='auto' -obnP  "[\x80-\xFF]" file.xml

Output:

868:31879:�
868:106287:�
868:106934:�
868:107349:�
868:254456:�
868:254678:�
868:286403:�
870:315585:�
870:389741:�
870:390388:�
870:390803:�
870:537910:�
870:538132:�
870:569811:�
870:598916:�
870:673324:�
870:673971:�
870:674386:�
870:821493:�
870:821715:�
870:853440:�
871:882578:�
871:956734:�
871:957381:�
871:957796:�
871:1104903:�
871:1105125:�
871:1136804:�

grep non-ascii characters in hex:

Command:

# Splitting the output of grep to ':'. Then printing the first 2 tokens and passing the 3rd one from xxd to convert to byte hex
LC_ALL=C grep --color='auto' -obnP  "[\x80-\xFF]" file.xml |\
xargs -I{} bash -c "echo {}|awk 'BEGIN { FS = \":\" };{printf \"%s:%s:\",\$1, \$2; print \$3 | \"xxd -p -l1\" }'"

Output:

868:31879:96
868:106287:92
868:106934:92
868:107349:92
868:254456:92
868:254678:92
868:286403:92
870:315585:96
870:389741:92
870:390388:92
870:390803:92
870:537910:92
870:538132:92
870:569811:92
870:598916:96
870:673324:92
870:673971:92
870:674386:92
870:821493:92
870:821715:92
870:853440:92
871:882578:96
871:956734:92
871:957381:92
871:957796:92
871:1104903:92
871:1105125:92
871:1136804:92
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
Marinos An
  • 9,481
  • 6
  • 63
  • 96
0

UPDATE 1 : changing main awk code from 9 to NF to handle leading and trailing edge ASCIIs


Keep it simple with awk - leverage RS for hands-free driving - no locale adjustments required :

__=$'123=pqr:\303\606?\414#45&6\360\641\266\666]>^{(\13xyz'

printf '%s' "$__" | od
0000000        1026765361       980578672       205489859       641020963
           1   2   3   =   p   q   r   :   Æ  **   ?  \f   #   4   5   &
          061 062 063 075 160 161 162 072 303 206 077 014 043 064 065 046
           1   2   3   =   p   q   r   :   ?  86   ?  ff   #   4   5   &
           49  50  51  61 112 113 114  58 195 134  63  12  35  52  53  38
           31  32  33  3d  70  71  72  3a  c3  86  3f  0c  23  34  35  26

0000020        3064066102      1581145526      2013997179           31353
           6    **  **  **   ]   >   ^   {   (  \v   x   y   z        
          066 360 241 266 266 135 076 136 173 050 013 170 171 172        
           6   ?   ?   ?   ?   ]   >   ^   {   (  vt   x   y   z        
           54 240 161 182 182  93  62  94 123  40  11 120 121 122        
           36  f0  a1  b6  b6  5d  3e  5e  7b  28  0b  78  79  7a        

0000036
printf '%s' "$__"
123=pqr:Æ?
          #45&6]>^{(
                      xyz
mawk NF RS='[\0-\577]+' | gcat -b
 1  Æ
 2  

Set a custom ORS for single-line output :

gawk NF RS='[\0-\577]+' ORS='|' | gcat -b
Æ||

If you insist on using nawk, then you need to modify the RS to ...

nawk NF RS='(\\0|[\1-\177]+)+'

... since nawk has issues handling either \0 or \\0 within a char class, it must be taken out of [...] and be replaced with an disturbingly verbose alternation

RARE Kpop Manifesto
  • 2,453
  • 3
  • 11
0

ripgrep (rg)

LC_ALL=C rg -v '[[:ascii:]]'  # --invert-match

brew install ripgrep, also on Linux.

Maybe I'm missing something, but I found this the most easy and fast alternative.

Pablo Bianchi
  • 1,824
  • 1
  • 26
  • 30
0
nawk    '/[\200-\377]/'
mawk    '/[\200-\377]/'

gawk -b '/[\200-\377]/'
gawk -e '!/^[\0-\177]*$/'

in gawk unicode mode just doing /[^\0-\177]/ is insufficient cuz it misses all the poorly-formed sequences and/or arbitrary bytes like \371

otherwise, you have to list all 128 bytes out in alternation form, and it's hideous

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11