4

I have a large list of words in a text file (one word per line) Some words have accented characters (diacriticals). How can I use grep to display only the lines that contain accented characters?

R OMS
  • 652
  • 2
  • 7
  • 19
  • That returns this error: usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]] [-e pattern] [-f file] [--binary-files=value] [--color=when] [--context[=num]] [--directories=action] [--label] [--line-buffered] [--null] [pattern] [file ...] – R OMS Oct 18 '17 at 16:51
  • Looking at [this](https://stackoverflow.com/questions/20690499/concrete-javascript-regex-for-accented-characters-diacritics) answer, you might just match a range of unicode characters `[\u00C0-\u017F]` – Mako212 Oct 18 '17 at 16:52

2 Answers2

3

The best solution I have found, for a larger class of characters ("What words are not pure ASCII?") is using PCRE with -P option:

grep -P "[\x7f-\xff]" filename

This will find UTF-8 and ISO-8859-1(5) (Latin1, win1252, cp850) accented characters alike.

LSerni
  • 55,617
  • 10
  • 65
  • 107
  • A side note. See the ASCII table. A-Z and a-z are represented by \x41-\x5a and \x61-\x7a respectively. Thus `grep -Po "[^\x41-\x5a\x61-\x7a]"` which greps for all characters except A-Za-z works. – Culip Aug 10 '20 at 18:57
  • Beware that this sometimes doesn't work with file arguments (even though it does when grep is reading from stdin !?). You may need to add `LC_ALL=C ` in front to make it work : `LC_ALL=C grep -P "[\x7f-\xff]" filename` – mivk Nov 21 '21 at 18:56
1

I have a solution. First strip the accents using "iconv" then do a "diff" for lines in the original file:

cat text-file | iconv -f utf8 -t ascii//TRANSLIT > noaccents-file
diff text-file noaccents-file | grep '<'
R OMS
  • 652
  • 2
  • 7
  • 19